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It is known that any bipartite unitary operator of Schmidt rank three is equivalent to a controlled 
unitary under local unitaries. We propose a standard form of such operators. Using the form 
we improve the upper bound for the entanglement cost to implement such operators under local 
operations and classical communications (LOCC), and provide a corresponding protocol. A part 
of our protocol is based on a recursive-control protocol which is helpful for implementing other 
unitary operators. We show that any bipartite permutation unitary of Schmidt rank three can 
be implemented using LOCC and two ebits. We give two protocols for implementing bipartite 
permutation unitaries of any Schmidt rank r, and showed that one of the protocol uses 0{r) ebits 
of entanglement and 0{r) bits of classical communication, while these two types of costs for the 
other protocol scale as 0(r log r) but the actual values are smaller for all r < 1100. Based on this 
we obtain upper bounds of the number of nonlocal CNOT gates needed to Implement bipartite 
classical reversible maps using classical circuits under two different conditions. We also quantify the 
entangling power of bipartite permutation unitaries of Schmidt rank two and three. We show that 
they are respectively 1 ebit and some value between log 2 9 — 16/9 and logj 3 ebits. 

PACS numbers: 03.67.Ac, 03.67.Lx, 03.65.Ud, 03.67.Mn 


I. INTRODUCTION 

The implementation of unitary operations is a key 
task in quantum information processing. Bipartite uni¬ 
taries are a particularly important class to study, because 
they are the base case for studying multipartite unitaries. 
Many tasks in quantum communication, games and cryp¬ 
tography are restricted to two parties. The evaluation of 
entanglement cost and/or classical resources for imple¬ 
menting unitary operations belong to a type of commu¬ 
nication cost problems in quantum information theory. It 
has applications in the study of quantum networks and 
distributed quantum computation, see for recent 

progress on implementing nonlocal unitaries or isome¬ 
tries on multiple qubits, using shared entanglement in a 
network or using a limited set of basic gates. 

Any bipartite unitary is the product of controlled 
unitaries |^, |^. The controlled unitary can be imple¬ 
mented with local operations and classical communica¬ 
tion (LOCC) and a maximally entangled state The 
entanglement cost scales with the logarithm of the num¬ 
ber of terms of control. The number can be as large as 
the dimension of the controlling system. Bipartite uni¬ 
taries of Schmidt rank not greater than three are e^iv- 
alent to controlled unitaries under local unitaries [Q-Q. 
Every Schmidt-rank-two bipartite unitary can be imple¬ 
mented using one ebit and LOCC but the best upper 
bound for the entanglement cost of Schmidt-rank-three 
unitaries appears to depend on the dimensions of the 
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Hilbert spaces: an upper bound on dA x ds system is 
log 2 min{d^, ds} ebits for cIa < In this paper 

we show that all Schmidt-rank-three bipartite unitaries 
can be implemented using log 2 min pA ,d|,4LdB/2j+2} 
ebits, where A is the controlling side of the unitary. This 
is presented in Theorem [10] based on a standard form 
constructed in Eq. We present a protocol for imple¬ 
menting some bipartite unitaries using multiple levels of 
control, and apply it to Schmidt-rank-three unitaries. 

Reducing the entanglement cost for implementing non¬ 
local unitary gates is a key problem in computation or 
communication tasks on networks, because entanglement 
is often imperfect and costly to produce. A protocol that 
uses less entanglement would have less error in the im¬ 
plemented unitary gate, giving rise to less error in the fi¬ 
nal outcome of the computation or communication task. 
Some tasks may involve multipartite unitaries or non¬ 
unitary operations, and studying the entanglement cost 
of bipartite unitaries may help the study of the entangle¬ 
ment cost of those operations. The classical communi¬ 
cation cost of the protocols in this paper is linear in the 
entanglement cost. Thus our protocols have less classical 
communication cost than the previous protocols. This 
is beneficial since classical communication is subject to 
noise and security concerns. 

It is known that there is a dimension-independent up¬ 
per bound for the entanglement cost of bipartite permu¬ 
tation unitaries with the help of a one-qubit ancilla on 
one side Q. The ancilla can be dropped from this state¬ 
ment at the cost of using more entanglement, since it can 
be prepared from another shared entangled pair of qubits. 
We construct a standard form of bipartite complex per¬ 
mutation unitaries of Schmidt rank r, when a “big row” 
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of the unitary contains at least r — 1 nonzero blocks. 
(The big row is defined in Sec lIIin We further investigate 
the maximum number of distinct nonzero diagonal blocks 
of a controlled permutation unitary of Schmidt rank r. 
The above two results give upper bounds of entangle¬ 
ment cost for implementing the corresponding types of 
unitaries. This is presented in Lemmas [T51 and [TTl When 
the Schmidt rank is not greater than four, we give tighter 
upper bounds of entanglement cost in Lemmas IlSI and ITfil 
and Corollary 1231 In particular, any Schmidt-rank-three 
bipartite permutation unitary needs only 2 ebits to imple¬ 
ment. We give a protocol that implements any bipartite 
permutation unitary of Schmidt rank r using O(rlogr) 
ebits of entanglement and 0{r log r) bits of classical com¬ 
munication. Then we present another protocol for the 
same task with the costs only scaling as 0(r), but the 
actual values are larger for all r < 1100, as discussed 
below Theorem [211 These results give upper bounds for 
the number of nonlocal CNOT gates for implementing 
a bipartite classical reversible map using a classical cir¬ 
cuit under two different conditions (A nonlocal CNOT 
gate is a CNOT gate that acts across the two parties, as 
opposed to acting locally on the bits within each party). 
The number is larger in the case that ancillas are required 
to be restored to the initial value, compared to the oppo¬ 
site case, and both results are under the assumption that 
the initial values of the ancillas are known. These results 
are an exponential improvement over the corresponding 
results in [9| . An example of a Schmidt-rank-four permu¬ 
tation unitary is given in Sec. IV Cl with its entanglement 
cost analyzed. As a byproduct, we point out that the 
expression of bipartite complex permutation unitaries in 
(EU is further evidence supporting a recent conjecture 
on the ranks and marginals of multipartite states El- 

Classical reversible circuits may have lower e nerg y cost 
compared to the circuits that involve erasures [il|. The 
current paper touches upon the topic of classical re¬ 
versible circuits, not only because our main result applies 
to it, but also we find that the design for the classical re¬ 
versible circuits could provide hints for designing better 
quantum LOCC protocols or quantum unitary circuits. 

The results so far are for the upper bound of entan¬ 
glement cost for implementing bipartite unitaries. An¬ 
other interesting topic is finding lower bounds for this 
quantity, such as the entangling power defined in (E71) . 
Any Schmidt-rank-r unitary can have entangling power 
at most log 2 r ebits, see the beginning part of Sec. IV PI In 
the case of r = 3, it is much smaller than the upper bound 
in thi^aper when cIa and ds are large. Recently, Soeda 
et al proved that 1 ebit of entanglement is needed for 
implementing any 2-qubit controlled nnitary by LOCC 
when the resource state is of Schmidt rank two. Stahlke 
et al E3 proved that if the Schmidt rank of the resource 
state is equal to the Schmidt rank of the bipartite uni¬ 
tary, and the unitary can be implemented by the state 
using LOCC or separable operations, then the resource 
state has equal nonzero Schmidt coefficients. In Exam¬ 
ple [H] we present a class of Schmidt-rank-three unitaries 


for which we do not know of a protocol with constant 
entanglement cost. In fact it is an open problem whether 
there is a constant upper bound for the entanglement 
cost of all Schmidt-rank-three bipartite unitaries. 

Next, we show that the entangling power of any 
Schmidt-rank-two bipartite permutation unitary is ex¬ 
actly 1 ebit by Lemma The counterpart of Schmidt- 
rank-three permutation unitary is some value between 
log 2 9 — 16/9 and log 2 3 ebits, as shown in Proposition 
[27l Again, there is a curious gap between the best known 
entanglement cost and the entangling power, similar to 
the case of general Schmidt-rank-three unitaries. 

The rest of this paper is organized as follows. In Sec. 
[IT] we briefly introduce the appendix. In Sec. lIIIl we intro¬ 
duce the notations and preliminary lemmas used in the 
paper. In Sec. IIVI we present the main result on Schmidt- 
rank-three bipartite unitary operators. In Sec. E we 
study bipartite complex permutation unitaries. We first 
present some preliminary lemmas, and then investigate 
the entanglement cost of bipartite permutation unitaries 
of Schmidt rank up to three in Sec. IV Al and study the 
protocol and entanglement cost for general bipartite per¬ 
mutation unitaries in Sec. IV Bl An example is given in 
Sec. IV Cl and the entangling power of bipartite permuta¬ 
tion unitaries is studied in Sec. IV PI Finally we conclude 
in Sec. [VT] 


II. SUMMARY OF TECHNICAL RESULTS 

To enhance readability we briefly summarize the re¬ 
sults of the current work and their relationships in this 
section. We have introduced Theorem uni in the intro¬ 
duction, which reduces the entanglement cost to about 
half of the previous upper bound in Q for large classes 
of bipartite Schmidt-rank-three unitaries. To study this 
theorem, we introduce Lemma Ej as a hard case among 
the possible forms of bipartite unitaries of Schmidt rank 
three. The proof of Theorem [TOl makes use of ProtocolsjT] 
andjU which are respectively a new two-level controlled 
unitary protocol, and a protocol from Q for implement¬ 
ing unitaries with group-type expansion. 

We study some basic properties of the real or complex 
bipartite permutation unitaries in terms of the Schmidt 
rank in Lemmas |T3| and [TTl The results are used through¬ 
out Sec. E In Lemmas M and m we investigate the 
structure and entanglement cost for (complex) permu¬ 
tation unitaries of Schmidt rank two or three. In Theo¬ 
rem [H] we show that any bipartite permutation unitary 
of Schmidt rank r can be implemented using local opera¬ 
tions with the help of min{log 2 (i?r-i-i) +?' + log 2 c, 8r — 8} 
ebits of entanglement and twice as many bits of classical 
communication, where Bj is the Bell number defined be¬ 
fore Lemma m The two terms in the result arise from 
Protocol [18] and Protocol [211 respectively. This signif¬ 
icantly improves over the result in Theorem 22 of Q, 
which states that such unitary can be implemented us¬ 
ing LOCC with 3x2’’ ebits. In Theorem [Ml we adapt 
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the two methods of implementing bipartite permutation 
unitaries in the proof of Theorem [52] to the decompo¬ 
sition of classical bipartite reversible circuits into local 
gates and nonlocal CNOT gates. In Proposition [571 we 
prove that the entangling power [defined in Eq. (|27)) ] of 
bipartite permutation unitaries of Schmidt rank three is 
in the range of [log 2 9 — 16/9, log 2 3] ebits. 


III. PRELIMINARIES 

In this section we introduce the notations and prelim¬ 
inary lemmas used in the paper. Let be the 

usual 2x2 Pauli matrices. Denote the computational- 
basis states of the bipartite Hilbert space H = Ha ®Hb 
by = I,-- - j = I,-- - Let I a and Ib 

be the identity operators on the spaces Ha and Hb, re¬ 
spectively. We also denote Id and 0^, respectively, as 
the identity and zero matrix of order d. The bipartite 
unitary gate U acting on H has Schmidt rank n if there 
is an expansion U = ® where the dA x dA 

matrices Ai,--- ,An are linearly independent, and the 
dB x dB matrices i?i, • • • , are also linearly indepen¬ 
dent. An equivalent definition named as the operator- 
Schmidt rank has been presented in 01I1- The above 
expansion is called the Schmidt decomposition. We name 
the A (B) space of U as the space spanned by all Aj (Bj) 
that appear in a Schmidt decomposition of U. It is well 
defined in the sense that the space is independent of the 
specific choice of the Schmidt decomposition. 

Next, 17 is a controlled unitary gate, if U is equivalent 

to Si=i l/)OI ® ^3 or Yfjh ® b'X/l via local unitaries. 
To be specific, t/ is a controlled unitary from A or B 
side, respectively. In particular, U is controlled in the 
computational basis from A side if 17 = l/X/l ® 
Bipartite unitary gates of Schmidt rank two or three are 
equivalent to controlled unitaries via local unitaries 
Q . We shall denote 17 0 W as the ordinary direct sum of 
two matrices 17 and W, and denote V (BbW as the direct 
sum of 17 and W from the B side. The latter is called 
the H-direct sum, and 17 and W respectively act on two 
subspaces Ha 0 H'g and Ha 0 H'^ such that Hg T H'^ . 
A permutation matrix (or called “permutation unitary” 
or “real permutation matrix”) is a unitary matrix con¬ 
taining elements 0 and I only. The partial permutation 
matrix is a matrix with elements being 0 and 1 only, sat¬ 
isfying that any row sum or column sum is not greater 
than I. So the partial permutation matrix may be not 
unitary. A bipartite controlled-permutation matrix 17 is a 
permutation matrix controlled in the computational basis 
of one system, i.e., U = Pj 0 L)-) where the projec¬ 
tors PjPk = SjkPj, Vj is a permutation unitary, and each 
Pj®Vj is a term of 17. A complex permutation matrix is a 
unitary matrix with exactly one nonzero element in each 
row and column. A “big row” of the dAdB xdAdB unitary 
matrix U refers to a ds submatrix given by a (7117, 

for some j G {1,. .. ,dA}- Similarly, a “big column” of 
U refers to a dAdB x ds submatrix given by 1717)^, for 


some 7 G {!,..., d^}- A “block” of U refers to a ds x dB 
submatrix given hy A{j\U\k), for some j,k G {I,..., d^}, 
and when j = k, the block is called a “diagonal block.” 

In all the protocols in this paper, the computational 
basis starts from |0) instead of |1). For an n-dimensional 
system, we respectively define the Fourier gate F = 
feio and the Z gate usually as Z = 

J2j=o but sometimes generalizing the |7)(7| 

to a high-rank projector, see Protocol 01 The Z basis is 
the computational basis. The Z-information means the 
information about which computational basis state that 
the state of the quantum system is in. 

In this paper, the “entanglement cost” of a bipartite 
unitary U is defined as 

£;,([/)= inf F,(p), (1) 

p 

where p is any one-shot exact deterministic LOCC pro¬ 
tocol to implement U, and Ec{p) is the amount of ini¬ 
tial entanglement needed in the protocol. “One-shot” 
means that only one copy of the unitary is implemented, 
while the word “exact” excludes the case that some other 
unitary that might approximate the given unitary is im¬ 
plemented, and “deterministic” means that the unitary 
is implemented with no chance of failure. The Schmidt 
rank of initially entangled state and the dimension of an¬ 
cillary space are finite in each protocol p, and there is no 
constant upper bound for these quantities. In the case 
that the resource entangled state is mixed, we suggest 
to use the entanglement of formation [l^ as the entan¬ 
glement measure, although we do not discuss the mixed 
entangled state in this paper. If there is entanglement 
left after the protocol, subtraction of the latter from the 
cost would lead to definitions of assisted entanglement 
cost. It is beyond the scope of this paper. 

The unit for entanglement is “ebit.” The entanglement 
contained in a maximally entangled pure state of Schmidt 
rank N is regarded as log 2 N ebits. Also, to simplify the 
notation, every bit of classical communication used in a 
protocol is called a “c-bit.” If the classical message is a 
signal among N equally possible signals, the amount of 
classical communication is regarded as log 2 N c-bits. 


A. Linear algebra 

Here we present a few preliminary results of linear al¬ 
gebra used throughout our paper. 

Lemma 1 Let D he a diagonal unitary matrix. The fol¬ 
lowing four statements are equivalent. 

(i) D has at least three distinet eigenvalues; 

(ii) the identity, D and are linearly independent; 

(Hi) any unitary in the linear span of the identity and D 
is proportional to one of them; 

(iv) any multiple of unitary in the linear span of the iden¬ 
tity and D is proportional to one of them. 
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Proof. (i) —>■ (a). Let x,y,z be the three distinct 
eigenvalues of D. Since x, y, z all have modulus one, the 

! 1 X X* \ 

matrix ^ V V* is the product of the diagonal 

y 1 z z* ) 

matrix diag(x*, y*, z*) and a Vandermonde matrix with 
columns permuted, the latter has determinant {y—x)(z — 
x){z — y). Since x, y, z are distinct, F is invertible. Since 
is a submatrix of the matrix whose columns are the 
diagonal vectors of the identity, D and , the latter are 
linearly independent. We have proved (i) —> {ii). 

{i) —>■ {in). Let the unitary he U = xl + yD where 
x,y are complex numbers. We have {xl + yD){x*I + 
y*D^) = /, hence xy*D^ + x*yD = (1 — \x\^ — \y\^)I. 
Then (i) — {in) follows from {ii), because of {i) —>■ {ii). 

Finally the relations {ii) —>■ (i), {in) —>■ {i) and {in) e->- 
{iv) are trivial. This completes the proof. □ 


In the following lemma, a matrix A is said to be “block 
diagonal” iff there is a permutation matrix P such that 

PAP"^ = ^ ^ where Ai and A 2 are square ma¬ 

trices. We regard a. k x k matrix as being of order k. 


Lemma 2 Suppose U is a unitary matrix of order at 
least two, and there is a nonzero diagonal matrix D such 
that there is a nontrivial linear combination of D and 

^ = ( [/t 0 ) that is unitary, and we denote it as V. 

Then X^VX is block diagonal, where X = 
and W is an n X n unitary matrix. 


fW 0 \ 
y 0 W 


Proof. By assumption, for the given n x n unitary 
matrix U, where n > 2, there exists a nonzero complex 
number c and a nonzero diagonal matrix D such that 
V := cD -\-U is proportional to a unitary matrix of order 

2n with n > 2, where U = ^ jyt ^^ differs 

from the V in the assertion by a constant factor, hence it 
suffices to prove the assertion for the current V. Suppose 
D = diag(xi,X 2 ,...,x„,yi,y 2 ,---, 2 /n), and the matrix 
elements of U are {U)ij = Uij, i,j G {!,...,n}. The 
rows of V are mutually orthogonal. From that the j’th 
and {n -f fc)’th rows of V are orthogonal, where j,k G 
{!,..., n}, we have -I- = 0, hence 

Vk = -X* if Ujk ^ 0, Vj, k. (2) 

Therefore, for any j G {1,..., n}, it must be that those 
Xp {i. < p < n) that are equal to Xj and those yq 
{1 < q < n) that are equal to —x* satisfy that their row 
and column coordinates determine a rectangular block in 
U consisting of elements Upq, and any element of U out¬ 
side of this block that are in the same rows or the same 
columns of this block must be zero. The last statement is 
due to the following reason: Suppose such a rectangular 
block contains Upq, then an element Upq' where q' satisfies 
yq! 7^ —X* is in the row labeled by p and outside of the 


rectangular block containing Upq] and from ([ 2 ]), we have 
Upq' = 0. Now we consider two cases: 

The first case is that there exist j,k G {1,..., n} such 
that Xj 7 ^ Xfc. In this case, the U contains some rectan¬ 
gular blocks that do not overlap in the rows and columns 
that they occupy. Since U is unitary, these rectangular 
blocks must be square blocks. Hence, U is block-diagonal 
after suitable permutation matrices are multiplied before 
and after it. From the form of V, this implies that V 
is block diagonal in the sense defined before the lemma. 
Thus the assertion holds with W being the identity ma¬ 
trix In. 

The second case is that xi = X 2 = • • • = x„. Then it 
must be that j/i = 2/2 = • • • = yn = —x\, since otherwise 
it can be deduced from ([ 2 |) that there would be a column 
of U that is zero, violating that U is unitary. Since U is 
unitary, there is an n x n diagonal matrix E and an n x n 
unitary matrix W such that U = WEW\ then 


( W 0 \ ( jIn 

0 W -rin 


IFt 0 
0 VFt 


(3) 


where 7 = xi. Since E, E\ and are all diagonal, 
the matrix | | is the direct sum of n 2 x 2 

y Ei --f*In J 

matrices up to a similarity transform by a permutation 
matrix. The rows and columns of the j’th 2x2 matrix 
correspond to the y’th and the (n + j)’th rows, and the 
j’th and the (n -|-y)’th columns of the original matrix, 
respectively. This completes the proof. □ 


Lemma 3 Any real linear combination of the three ma¬ 


trices I 2 , 


w 0 
0 w* 
a unitary matrix. 


and 


0 X 
-X* 0 


is proportional to 


Proof. LetV = al2 + b^^ w* J y x* q J 

a, b, c are real numbers. By direct computation one can 
show that V is proportional to a unitary matrix. This 
completes the proof. □ 


IV. TIGHTER UPPER BOUND FOR 
ENTANGLEMENT COST OF IMPLEMENTING 
SCHMIDT-RANK-3 UNITARIES 

On the problem of exact implementation of bipartite 
nonlocal unitaries using LOCO and shared entanglement, 
we use or discuss the following three known protocols. 
(1) The two-way teleportation protocol, i.e., teleporting 
the system of one party to the other party, performing 
the unitary there, and teleporting the system back to 
the original party. (2) The protocol for implementing 
controlled unitaries in Sec. Ill of Q, which is briefly re¬ 
viewed as Protocol 2] below, and it will be called “the 
basic controlled-unitary protocol.” A simple extension 
of it is Protocol [Sj and the latter is the basis for the 
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two-level controlled Protocols [ 6 ] and [T] (3) The group- 
type protocol in Sec. IV of which is briefly reviewed 
as Protocol [ 8 ] below. Protocol El is used in Sec. 0 and 
Protocols [7] and E] are used in the proof of Theorem [TU] 
(ii). 

Protocol 4 (The basic controlled unitary protocol.) 

The unitary to be implemented by two parties, Alice 
and Bob, is 

Af-l 

U=Y^Pi,(^Vk, (4) 

fc =0 

where Pk are mutually orthogonal projectors on TIa-, and 
14 are unitary operators on Tts. The Pk may be of rank 
greater than 1 , meaning that the dimension of TIa may 
be larger than N. 

A figure for this protocol is Fig. 5 of This figure 
was originally for the case that Pk are all rank-one, but 
with suitable interpretation of the gates in the circuit (see 
Sec. Ill C of [i|), it works for the general case of higher 
rank Pk- For the protocols in this section only, the X 
gate on a V-dimensional Hilbert space is defined as 

JV-l 

modiV)(fc|. (5) 

k=0 

The steps of the protocol are as follows. 

0. The two parties initially share the following entan¬ 
gled state on ancillary systems a and 6 , which are with 
Alice and Bob, respectively: 

N-l 

= ( 6 ) 

1 . Alice performs a controlled-gate Pj ® 

on systems A and a, with A as the control. (The 

X^ means X to the power j.) Then Alice performs a 
measurement on a in the standard basis, and sends the 
result I to Bob. 

2. Bob applies the gate A* to b. This is followed by a 

controlled gate J2k=o I^X^I ® ^ s-nd B, with b as 

the control. Then Bob does a Fourier gate on b (defined 
in Sec Ell), and measures b in the standard basis. The 
outcome m is sent to Alice. 

3. Alice carries out a Zm = 2 '“"* correction on A, 

where the Z is defined as Z = (c.f. Sec. 

Ill C of 0 ) , and this definition of Z reduces to that 
in Sec uni in the case that all Pk are rank-one. This 
completes the protocol. 

The resource consumption of the protocol is log 2 N 
ebits and 2 log 2 N c-bits. 

Protocol 5 (The extension of the basic controlled uni¬ 
tary protocol to the case that some projectors in m are 
replaced with zero operators.) 


If the unitary to be implemented by Alice and Bob 
is given by but only some Pk are projectors, and 
some others are zero operators (the output is zero for any 
input), then the steps of Protocol 0] can still be carried 
out. Note that the controlled-A-^ gate in step 1 and the Z 
gate in step 3 could be defined using the same expression 
as before but with the Pj understood as being projectors 
or zero operators. The protocol still uses log 2 N ebits 
and 2 log 2 N c-bits. Suppose there are N' < N operators 
among the {Pk} that are nonzero; then the same unitary 
could be carried out with only log 2 N' ebits and 2 log 2 N' 
c-bits using Protocol 0] Nonetheless, the less efficient 
protocol turns out to be useful in Protocols|6]and[7]below. 

Next, we introduce a recursive-control protocol for im¬ 
plementing some bipartite unitaries with LOCC and ini¬ 
tial entanglement. 

Protocol 6 (Protocol for implementing a bipartite uni¬ 
tary with two levels of control — The special case that 
the lower-level controlled unitaries are controlled from a 
fixed side.) 

The bipartite unitary to be implemented on TLa ®'Hb 
is of the following form: 

M-l 

fc =0 

where TLa = 'Hc^'Pd, and TLe = and Pk are 

orthogonal projectors on TLc, and 

rife —1 

= ^K® Qf ( 8 ) 

are controlled unitaries with local unitaries on T-Ld- 

The are projectors onTLs and are orthogonal among 
different j for the same k. Let N := maxjnfc : k = 
0,1, ■ ■ ■ ,M — 1}. By introducing some zero operators to 
the set of and calling the new operators we 

may write all using N terms: 

N 

= ( 9 ) 

i=o 

where f/^ are still local unitaries and some of them are 
not present in Eq. (|8]). 

The idea of the protocol can be roughly summarized 
as follows. The higher level of the protocol is “fc controls 
S'l',” and the lower level is “j controls The steps 

are as follows. 

0. Alice and Bob share a maximally entangled state 
of Schmidt rank M on Tla ® T~Lb, and another maximally 
entangled state of Schmidt rank N on'Hq®'Hr- The 
subsystems a and q are on Alice’s side, while b and r are 
on Bob’s side. 

1. They perform the first half of the basic controlled- 
unitary protocol (Protocol0]) on 'He and 'Ha ®'Hb, until 
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the Xj^ gate in the protocol is done [the X is defined in 
Eq. ([2|)]. Now they share a maximally entangled state 

2. They perform Protocol [5] to implement using 

their information about k stored in the entangled state 
above, with the help of a maximally entangled state of 
the form \j) ® b)- More specifically, in the 

lower-level protocol, every unitary gate is controlled by 
the \k)c state on Alice’s side or the \k)b on Bob’s side. 
If there are measurements not in the standard basis in 
the lower-level protocol, we decompose it as a unitary 
followed by a measurement in the standard basis, so that 
all measurements are in the same basis and thus need not 
be controlled by information about k. 

3. They have effectively performed the gate from 
the protocol in Sec. Ill of which is the gate in the 
higher-level of the current protocol. Next, the subsystem 
b is measured in the Fourier basis, and a local unitary 
correction, i.e., the integer powers of the generalized Z 
gate defined in the basic controlled-unitary protocol is 
done on C. Note that C is not being measured, since it 
is a “data” system and not an ancilla. 

The whole protocol uses log 2 {MN) ebits and 
2log2{MN) c-bits. Note that in step 2, the measure¬ 
ment outcomes in the lower-level protocol are the same 
for different controlling states labelled by k. This is ac¬ 
ceptable, since the Protocol [5] (used as the lower-level 
protocol here) works under any measurement outcome 
anyway. 

Protocol 7 (Protocol for implementing a bipartite uni¬ 
tary with two levels of control — The general case that the 
lower level unitaries are controlled from different sides.) 

In Protocol[ 6 l the lower level unitaries are all controlled 
from the same side (and opposite to the direction of con¬ 
trol in the higher level, since the case of same direction 
is trivial in that the unitary is then a one-level controlled 
unitary). Here we consider a generalization: the lower- 
level unitaries can be controlled from different sides. For¬ 
mally, the target unitary U is of the following form: 


M-l 


U=Y1 

k—0 

( 10 ) 

where Ra = Rc^'kio, and Re = Rd®Rb, and Pk are 
orthogonal projectors on Re- For each , there exists 

an integer rifc > 1 , such that at least one of the following 
two equations hold: 

rifc —1 

Sk = E 

( 11 ) 

rik-l 

or = E 

i=o 

( 12 ) 


where and Ujfj are local unitaries onTLo and "Hs, 
respectively. The are projectors on Hs and are 


orthogonal among different j for the same k. The 
are projectors on'Hu and are orthogonal among different 
j for the same k. Let N := maxjnfc : k = 0,1,..., M—1}. 

(k) 

By introducing some zero operators to the set of Q) 
and Rj , and calling the new operators Q} or Rj , we 
have that for each S^, at least one of the following two 
equations hold: 

N 

= ( 13 ) 

i-0 

N 

or = Y.Rf®U^,, (14) 

i=o 

where C/^ and are local unitaries on Rd and Rb, 
respectively, and some of them are not present in Eq. ®. 

The steps of the protocol are modified from ProtocollS] 
as follows: The first two steps are the same as the Steps 
0 and 1 of Protocol [H after which both sides have a copy 
of the computational-basis information of the higher-level 
controlling state. And since the form of the overall uni¬ 
tary is known, each party knows whether he or she is to 
act as the controlling party in the lower-level protocol, 
depending on the higher-level controlling state. So in the 
modified Step 2 of the protocol, each party does what is 
supposed to be done locally in the lower-level controlled- 
unitary protocol, with each unitary gate being controlled 
by the local higher-level controlling state labeled by k, 
but the measurements are all in the standard basis and 
thus need not be controlled (if there are measurements 
not in the standard basis, we decompose it as a unitary 
followed by a measurement in the standard basis). There 
are two stages of classical communication (in opposite 
directions) in Step 2, and for each such communication 
stage, the party that is supposed to send classical mes¬ 
sages does exactly the same operations as before, but 
the opposite party measures in the computational basis 
on an extra ancilla initially in the \ j) state, 

and sends the outcome to the other party. The choice of 
measuring a useful system or a dummy ancilla introduced 
above is determined by the higher level controlling state 
labeled by k. However, for actual implementation, the 
actual measurement should be on a fixed system. This 
can be resolved by a controlled-swap gate controlled by 
k, which conditionally swaps the system to be measured 
into a fixed system before doing the measurement. The 
final step is similar to Step 3 of Protocol HI 

The whole protocol requires the same amount of en¬ 
tanglement as in Protocol HI but generally requires more 
classical communication, since the correct and dummy 
messages are sent in both directions simultaneously in 
the two stages of classical communication in Step 2, so we 
allow twice as much classical communication in the lower- 
level protocol. Thus the overall protocol uses log 2 (MiV) 
ebits and 21 og 2 (MiV^) c-bits. A dummy message is the 
measurement outcome of a system which was originally 
(before the controlled-swap gate mentioned in the previ- 
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ous paragraph) an ancilla in a fixed initial state. Note 
that the dummy classical message is only dummy for 
some of the higher-level controlling states labeled by fc, 
but is the correct message for some others. Such message, 
even if “correct”, does not carry any information about 
the input state for the overall unitary, by the design of 
the basic controlled-unitary protocol. The rationale be¬ 
hind the above technique is as follows: The choice of 
which lower-level unitary is being implemented should be 
indistinguishable from an outside observer, since the in¬ 
formation about the higher-level controlling state should 
not be leaked to the outside observer, which is neces¬ 
sary for implementiM a unitary operation. The reason 
is in Theorem 1 of [5|, which says that implementing a 
unitary successfully is equivalent to that no information 
about the input state of the unitary is leaked to an “envi¬ 
ronment” system (the tensor product of the environment 
system and the output system of the unitary is the entire 
output system of the protocol). 

Protocol 8 (Protocol for implementing a bipartite uni¬ 
tary given its group-type expansion.) 

This protocol is illustrated in Fig. 8 in @ (except for 
changes in symbols in the description below), and it im¬ 
plements bipartite unitaries of the form 

U=Y.VA{f)®WB{f). (15) 

where the Va(/) are unitaries acting on 'Ha, and they 
form a projective unitary representation of a finite group 
G, and Wsif) are arbitrary operators acting ohHb but 
they satisfy that U is unitary. This protocol uses a max¬ 
imally entangled resource state of Schmidt rank |G| (the 
order of G). Thus the entanglement cost is log 2 |G| ebits. 
The classical communication cost is 21 og 2 |G| c-bits. For 
any unitary U, we may expand it in the form (1151) by let¬ 
ting G be the generalized Pauli group (ignoring overall 
phases) {X^■ j,k G [ 0 ,dA — 1 ]} which is of order d\, 
since the d\ generalized Pauli matrices form a basis for 
the space of dA x dA matrices. 

We abbreviate the steps of the protocol here. For our 
purposes, a good property of the protocol to be utilized 
for the proof of TheoremflOlis that when U is the A-direct 
sum of some unitaries, it is often the case that there is 
a relatively small group G (by “small” we mean smaller 
than d\) such that U can be expanded in the form (ITSl) . 
This is because of the following reason: Each component 
in the A-direct sum form of U is also expandable using 
the form (fTh)) : thus, its size divided by ds is the dimen¬ 
sion of a (projective) unitary representation of the group 
G, where the representation is obtained by restricting 
Va(/) to the relevant subspace of Ha, for all f G G. De¬ 
note the dimension of such a projective representation as 
Ui, i = 1,..., iF, where K is the total number of compo¬ 
nents in the A-direct sum form of U. Assume that there 
is a group G that has inequivalent irreducible projective 
unitary representations of sizes n^, i = 1 ,..., K', where 
K' > K, and the Ui with i > K {in the case K' > K) 


are arbitrary positive integers (this is, of course, a big as¬ 
sumption and does not hold for most bipartite unitaries, 
but note that we may regard several blocks in an A-direct 
sum form of U as one block to increase the chance that 
such a group G exists, which is a technique used in the 
proof of Theorem fTOll . then we may do the following steps: 
Arbitrarily choose a factor system (see the definition in 
Q) from the set of factor systems of G that admit in¬ 
equivalent irreducible projective unitary representations 
of sizes Ui, i = 1,... ,K' (the existence of such a factor 
system is guaranteed by the assumption above). Then 
choose a projective unitary representation of G that con¬ 
tains all inequivalent irreducible projective unitary rep¬ 
resentations belonging to this factor system. This would 
be a linearly independent set of matrices according to [^, 
Theorem 4], and they are of a simultaneous block diago¬ 
nal form. We then remove some diagonal blocks from all 
these matrices so that the remaining blocks are of sizes 
rij, i = 1,..., AT. Then the resulting matrices would be 
generally linearly dependent, and from the construction, 
the resulting set forms a (possibly overcomplete) basis 
for the space of matrices with the same block structure. 
Thus, this set of unitary matrices can be used to expand 
the bipartite unitary U in the form of (1151) . 

In our application in the proof of Theorem [10] in this 
paper, we choose the type of group G directly and figure 
out its suitable size. A different problem has been dis¬ 
cussed in [13: which is trying to find the smallest group 
G when the matrix of U is known. However, there is some 
similarity: Our reason for choosing the dihedral groups 
as the type of group G in the proof of Theorem [TOI is 
based on the H-direct-sum form of U that we proved. 
The algorithm for choosing the group G in na also is 
based on finding the A-direct-sum form of U (which cor¬ 
responds to the block diagonal structure of the operators 
on Ha that are used to expand U). 

The protocols with two levels of control can be gener¬ 
alized to protocols with multiple levels of control. Some 
other generalizations are possible (but not used in this 
paper): The lower-level operators in the target uni¬ 
tary of the form © need not be a controlled unitary, but 
could be unitaries with group-type expansion in Proto¬ 
col m and thus the inner level of the protocol becomes 
Protocol ID 

For studying Theorem 1101 we introduce the prelimi¬ 
nary lemma below. We note that the simplest type of 
Schmidt-rank-three bipartite unitaries, which are con¬ 
trolled unitaries with three terms, are generally not in¬ 
cluded in Lemma (9] due to the restrictions on the coeffi¬ 
cients Cji,Cj 2 ,Cj 3 and the matrices T 2 and T 3 below. 

Lemma 9 Suppose there are three linearly independent 
dxd unitary matrices Id, T 2 and T 3 , where Id is the iden¬ 
tity matrix, and T 2 is diagonal, and T 3 is not diagonal, 
and T 2 ,T 3 are not simultaneously diagonalizable under 
a unitary similarity transform; and K distinct triplets 
{cji,Cj 2 ,Cj 3 ), where j = 1,...,K, Cji are real and non¬ 
negative, Cj 2 and Cj 3 are nonzero complex numbers, such 


that 


K 

u = '^ \j){j\ ® {Cjild + Cj2T2 + CjsT^) ( 16 ) 
i=i 

is a bipartite unitary of Schmidt rank 3 on a K x d space 
T-La ® 'Hb- 

Then up to local unitaries, there is a decomposition of 
U with the following direct sum structure on TIb- U = 

Id = n = 

©fc=i^ 3 ^^> satisfying that each 
K 

Uk = Y^ \j){j \® + C, 2 T^'^^ + c,3rf'^) (17) 

1=1 

is a unitary on the K x dk subspace Ha ^Hbu d = 
Sfc=i 'Ik, and that is diagonal, and for each k > 1 , 

= diag(e*“'“, —e“*“'“), € M; is a non-scalar 
2 x 2 unitary whose non-diagonal entries are equal and 
positive. 

The proof of this lemma is given in Appendix [Al 
Lemma leads to the following result, where assertion 
(i) is a structure theorem for Schmidt-rank-3 bipartite 
unitaries. Note that the assumption of the result implies 
> 3 and ds > 2. 

Theorem 10 Assume that U is a Schmidt-rank-3 bipar¬ 
tite unitary controlled from the A side. Then the follow¬ 
ing assertions hold. 

(i) Either U is the A-direct sum of at most three unitaries 
of Schmidt rank at most 2, or U is locally equivalent to 
a B-direct sum of controlled unitaries of Schmidt rank at 
most 3. Each of the controlled unitaries is on a dA x ^ 
or dA X 2 space controlled in the computational basis of 

Ha- 

(a) U can be implemented by local operations and 


log2min{d^,d|,4[dB/2j -\- 2 } 

(18) 

ebits of entanglement and 


21 og 2 min {dA, max{ 12 ,4[dB/2j -|- 2 }} 

(19) 

c-bits. 


The proof of this theorem is given in Appendix |B] 
Given that the A side is the control, the result in @ gives 
an entanglement cost upper bound of log 2 min{d^, d^} 


ebits. This old upper bound is always not less than the 
new upper bound in (fT51) . When dA,dB are both large 
and dA is about d \, the new upper bound in (1151) is about 
log 2 ( 2 d_B) = 1 + log 2 d_B ebits, which is about half of the 
old upper bound which is about log 2 d^ = 2 log 2 ds ebits. 

We show two classes of examples. The first shows that 
for some U, the entanglement cost can be much less than 
the upper bound in Theorem llOl iil. 


Example 11 Consider a Schmidt-rank-three unitary U 
of the form (ED. Let Hb be of dimension 2n, and Ti = 
Ib, T 2 = ©"^iCTz, T 3 = ©”^Jcos(tj)CTa:+sin(tj)CTy], where 
tj (1 < J < ki) are some different real numbers. Then 
{T 2 Y = (Ts)^ = Ib, and T 2 T 3 = -T 3 T 2 . Actually, by 
conjugation using a local diagonal unitary on Hb, we 
can transform T 3 into ©jCgCTj, while keeping Ti and T 2 
unchanged. The other Tj with j > 3 are given by Tj = 
cos djTisin 0 j cos ())jT 2 + i sin 0 j sinc^jTs, where dj and 
4>j are real. The B space of U is spanned by a projective 
representation of an Abelian group of order 4 (the Klein- 
four group), hence Protocol[5] implements U using 2 ebits 
of entanglement and LOCC. This is much less than the 
upper bound in Theorem llOl iil when dA and dB are large. 

The second class of examples is still for unitary U of the 
form in Lemma IHl but with essentially different blocks in 
different subspaces of Hb- It suggests that there might 
not be an easy improvement to the upper bound in The¬ 
orem [TUKii) for general Schmidt-rank-three bipartite uni¬ 
taries. 

Example 12 We use the notations in the proof of 
Lemma M but assume that the unitary is without the 
diagonal part, i.e. the subspace Hbi is a null space. As¬ 
sume the diagonal elements of the 2 x 2 matrices T^^^ 
and are SkVf — b'^-\-bi and Sktb^J'^^-\-tbi, respec¬ 
tively, where 6 G ( 0 , 1 ] is a variable dependent on fe, and 
t is a positive constant less than 1 , e.g. t = 1 / 2 , and the 
sign factor Sk for the real part is either 1 or — 1 . Sup¬ 
pose the diagonal elements with the positive Sk appear 
first in each T^'^^ and , and denote such elements as 
T 2 k and D^k, respectively. Then Im(T 2 fe), Im(Zl 3 fe), and 
II^{T2kD‘zk) respectively, which is useful 

for checking the result below. Since jUsfej < 1, the two 

(k) 

off-diagonal elements of ' are chosen to be equal real 
numbers such that is unitary. Let the (c^i, Cj 2 , © 3 ) 
satisfy that cji = {ty—l)/^y (1 -|- y'^){ty — 1 )^ -\- t'^y'^, and 
Cj 2 = icjity/{ty-l), Cj^ = icjiy, for j = 1 ,... ,M, where 
M is an arbitrary positive integer, and y = yj > \/t is a 
real positive number independent of k but dependent on 
j. Note that b = bk is independent of j. The diagonal 
part of Eq. (IA2I) can be written as 

(cji) -|- (cj2) + (© 3 ) — 2cjiCj2fin{T2k) 

—2cjiCj3lm.{D3k) + ‘Icj2Cj3Re{T2kE>3i^) = 1 ( 20 ) 

for k = 1,2,... ,d. Here we have used that c^i is real, 
and Cj 2 and Cj 3 are pure imaginary, and that T 2 , T 3 are 
unitary, and we denote Cj 2 '■= Im(cj 2 ), 5^3 := Im(cj 3 ). It 
is easily verified that there are an infinite number of so¬ 
lutions of y = yj and b = bk for (12011 when t is fixed, and 
by choosing some sufficient but finite number of them 
to be used in the matrix U, the U would have Schmidt 
rank three. The U is unitary because each 2x2 block in 
each controlled operator on the B side is unitary, and the 
latter follows from Lemma [3] and our choice of the T 2 k 
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and -Dsfe, and that cji is real, and Cj 2 and Cjs are pure 
imaginary. The statement about the number of solutions 
above implies that the dimensions (Ia and ds are arbi¬ 
trarily large, and we do not know of any simple protocol 
that implements this class of unitaries with a constant 
number of ebits and LOCC. This suggests there might 
not be an easy improvement to the upper bound in The¬ 
orem m ii). 


V. ENTANGLEMENT COST AND 
ENTANGLING POWER OE BIPARTITE 
PERMUTATION UNITARIES 

This section is motivated by the following question. 
What is the entanglement cost for Schmidt-rank-three 
bipartite permutation unitaries? The result in Theorem 
22 of 0 gives an upper bound of 24 ebits, with the help 
of a one-qubit ancilla on one side. Other motivations to 
study the permutation unitaries are in the first paragraph 
of Sec. IV Cl and also in 0. We shall first develop some 
preliminary results about bipartite (complex) permuta¬ 
tion unitaries of general Schmidt rank, and then derive 
the improved upper bounds for the entanglement cost for 
bipartite permutation unitaries of small Schmidt rank in 
Sec. IV Al The case of general Schmidt rank is studied 
in Sec. IV Bl We give an example in Sec. IV Cl and study 
the entangling power of bipartite permutation unitaries 
of Schmidt rank up to three in Sec. IV PI 

Lemma 13 Let U be a complex bipartite permutation 
matrix of Schmidt rank r. Then the following assertions 
hold. 

(i) The nonzero blocks in any big row or big column of U 
are linearly independent. The number of them is at least 
1 and at most r. 

(ii) Suppose a big row of U contains r linearly indepen¬ 
dent blocks. Then up to local complex permutation ma¬ 
trices the first r blocks in the big row are orthogonal pro¬ 
jectors, whose sum is the identity matrix. 

A similar statement holds when all “row” are replaced 
with “eolumn”. 

(Hi) Under the assumption in (ii), up to local complex 
permutation matrices U is a complex r-term controlled- 
permutation unitary from the B side. The projectors in 
the terms are exactly the projectors in (ii). Such unitary 
can be implemented using log 2 r ebits and LOCC. 

(iv) If U is a real permutation unitary, then (ii) and (Hi) 
hold with all occurrences of the word “complex” removed. 

(v) Suppose a big row of U contains r — 1 linearly in¬ 
dependent bloeks. Then up to local complex permutation 
matrices the first r —1 blocks in the big row are orthogonal 
projectors, whose sum is the identity matrix. 

A similar statement holds when all “row” are replaced 
with “column”. 

(vi) Under the assumption in (v), assume that the pro¬ 
jectors and their orders are respectively Pj and Sj for 
j = 1, • • • , r — 1. Up to local complex permutation matri¬ 


ces, we have 

U=(^iQ(8) P) 0A ^{Qj 0 Pj^ ©B 

( 21 ) 

where n € {0} U [2, r — 1], P, Q and Qj are all complex 
permutation matrices on their respective subspaces. P is 
of size (X]j=i ^ rnatrices 

Q and Qj (ij < n) are orthogonal in both the input and 
output spaces. Furthermore, Uj is a complex permuta¬ 
tion matrix of Schmidt rank at most two on the bipartite 
Hilbert space TLa x span{|si + • • • + s„ + 1), • • • , Ids)}. 
The B .space of Uj contains Pj. 

If n G [2,r — 2], then U can be implemented using 
max{2 + log 2 n, 2 + log 2 (r- — n — 1)} ebits and LOCC. If 
n = 0, then U can be implemented using 2 + log 2 ('r — 1) 
ebits and LOCC. Ifn = r — 1, then U can be implemented 
using 1 + log 2 (r — 1) ebits and LOCC. 

(vH) In (vi), if U is a real permutation unitary, and n = 
0, then under local permutations, either U can be written 
in the n = r — 1 case of the form of m, or U is a 
controlled-permutation unitary controlled from the B side 
with at most 2(r’ — 1) terms, thus U can be implemented 
using 1 + log 2 (r — 1 ) ebits of entanglement. 

The proof of this lemma is given in Appendix [C] The 
partial transpose has been used to study the separability 
problem in entanglement theory [1^ . Recently it has 

been used to study the ranks of marginals of multipartite 
quantum states [lOj, in terms of the following conjectured 
inequality 

k k 

rank(y~^ Aj 0 Bj) < k ■ rank(^^ Aj 0 BJ), (22) 
j=i t=i 

where Aj (resp. Bj) are matrices of the same size and 
T denotes the transpose. In previous works we have pre¬ 
sented a few bipartite unitaries satisfying the inequality 
[ 3 , 0 . One can verify that the partial transpose of the 
complex permutation unitaries in (ii) and ( 1211 ) are still 
unitary matrices. When considered as one of the bracket 
expressions in the Ihs or rhs of (| 2 ^ . they both satisfy 
(1221) . They provide further evidence supporting the con¬ 
jecture. We do not know whether all bipartite permu¬ 
tation matrices or complex permutation matrices satisfy 

(EH). 

Next we describe some simple properties about the 
ds X ds blocks in bipartite permutation matrices. Let 
m(r) denote the maximum possible number of distinct 
diagonal blocks in a Schmidt-rank-r bipartite controlled- 
permutation unitary. Let m'{r) denote the maximum 
possible number of distinct permutation matrices in the 
B-space of a Schmidt-rank-r bipartite permutation uni¬ 
tary. Let n(r) denote the maximum possible number of 
distinct nonzero partial permutation matrices in the B- 
space of a Schmidt-rank-r bipartite permutation unitary. 
Using these definitions we state the following lemma. 
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Lemma 14 (i) m[r) is equal to the maximum number 
of distinct permutation matrices in the linear span of r 
arbitrary permutation matrices of the same size. 

(ii) m{r) = 2’’“^. 

(in) The entanglement cost of any Schmidt-rank-r 
controlled-permutation unitary is not more than r — 1 
ebits. 

(iv) m'(r) is not greater than the maximum number of 
distinct permutation matrices in the linear span of r ar¬ 
bitrary partial permutation matrices of the same size. 

(v) m'{r) = 2’’“^. 

(vi) n(r) = 2’’ — 1, and the maximum in the definition 
of n(r) is achieved only when the bipartite permutation 
unitary is equivalent to a controlled unitary from the B 
side under local permutation unitaries. 

The proof of this lemma is given in Appendix [D1 Evi¬ 
dently the m'(r) and n{r) would be unaffected if we re¬ 
place B by A in their definition. 

A. Entanglement cost of bipartite permutation 
unitaries of Schmidt rank two or three 

We have studied the properties of the complex bipar¬ 
tite permutation unitaries in terms of the Schmidt rank in 
Lemmas[T3]and[T4l In this subsection we study the bipar¬ 
tite permutation unitaries of Schmidt rank two or three. 
They are locally equivalent to controlled unitaries [^-Q. 
So they can be implemented using the basic controlled- 
unitary protocol by directly using the controlled form, 
however this might require more than minimal amount 
of entanglement. The Lemma [12] (i) below, together 
with Lemma 1261 imply that the entanglement cost by 
directly using the controlled form is minimal for the case 
of Schmidt rank two. 

Lemma 15 (i) Any Schmidt-rank-two bipartite permu¬ 
tation unitary is equivalent to a two-term controlled- 
permutation unitary under local permutation unitaries. 
(ii) Any Schmidt-rank-two bipartite complex permutation 
unitary is equivalent to a two-term controlled-complex- 
permutation unitary under local complex permutation 
unitaries. 

Proof. Let us prove (ii) first. Denote the complex uni¬ 
tary as U. Its standard matrix form, also denoted by U, 
is a dAds x dAds matrix. If there is a big row or column 
of U containing two nonzero blocks, then the assertion 
follows from Lemma [T5f hi fiiil. It suffices to consider the 
case that there is exactly one nonzero block in any big 
row or column of U. Up to local permutation matrices on 
TIa we may assume that [/ is a block-diagonal complex 
permutation matrix, and the first two diagonal blocks 
Di,D 2 are linearly independent. Up to a local complex 
permutation matrix on TIb, we may assume Di = Ib- If 
all diagonal blocks of U are proportional to Di or £> 2 , 
then the assertion follows. If there is a diagonal block 
which is not proportional to any one of Di, D2, then D2 


has to be diagonal and if D 2 has only two distinct diago¬ 
nal entries, then U is equivalent to a controlled complex 
permutation unitary from the B side with two terms, up 
to local permutation unitaries. Thus we only need to con¬ 
sider the remaining case, i.e., that D 2 is diagonal and has 
at least three distinct diagonal entries. However in this 
case D 2 cannot be unitary by Lemma |TJ This completes 
the proof of (ii). 

The proof for (i) is similar. If there is a big row or 
column of U containing two nonzero blocks, the assertion 
follows from Lemma I13l ivl. In the remaining case, the 
result follows from Lemma im iil. □ 

Now we investigate the structure and entanglement 
cost for complex permutation unitaries of Schmidt rank 
three. In particular, the real counterpart is completely 
characterized in (i). 

Lemma 16 (i) Up to local permutation unitaries, any 
Schmidt-rank-three bipartite permutation unitary is ei¬ 
ther equivalent to a three-term or four-term controlled- 
permutation unitary, or equivalent to the direct sum of a 
product permutation unitary and a two-term controlled- 
permutation unitary. Therefore such unitary can be im¬ 
plemented using 2 ebits and 4 c-bits. 

(ii) Any Schmidt-rank-three bipartite complex permuta¬ 
tion unitary that is not equivalent to a diagonal unitary 
under local permutation unitaries can be implemented us¬ 
ing 3 ebits and LOCC. 

(Hi) Any diagonal Schmidt-rank-three bipartite complex 
permutation unitary, whose diagonal blocks contain the 
identity matrix and a diagonal matrix of exactly two dis¬ 
tinct diagonal elements, can be implemented using 2 ebits 
and LOCC. 

The proof of this lemma is given in Appendix |E| An 
example for “the direct sum of a product permutation 
unitary and a two-term controlled-permutation unitary” 
is given by the following unitary on 3 x 2 system: 

U = [|1)(1|0(|I)(2| + |2)(I|)] 

[(|2)(2| + |3)(3|) © |I)(I| + (|2)(3| + |3)(2|) © |2)(2|]. 

(23) 

B. Entanglement cost of bipartite permutation 
unitaries of general Schmidt rank 

The following Protocol |T8| implements bipartite per¬ 
mutation unitary U of arbitrary Schmidt rank r. The 
computational basis for each system starts with |0). The 
entanglement and classical communication cost of the 
protocol in terms of r is analyzed in Theorem 1221 Before 
introducing the protocol, we define the so-called effective 
input and output dimensions for U. An example unitary 
illustrating these definitions is in Example (251 in Sec. IV Cl 

Definition 17 (i). The effective input dimension of A is 
the number of types of input states of A. A type of input 
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states of A (or “an input type of A'’) is a subspace of Ha 
spanned by computational basis states, so that any two 
big columns of U corresponding to two computational 
basis states in the subspace have the same collection of 
blocks in them, ignoring the positions and the relative 
order of the nonzero blocks in the big column. 

(2) . The effective output dimension of A relative to 
an input computational basis state of Ha is the number 
of nonzero blocks in the big column of U correspond¬ 
ing to the input computational basis state of Ha- And 
the labels for each effective dimension for a given input 
computational state of Ha is determined by the order in 
which the nonzero block appears in the big column. The 
output computational basis state of Ha corresponding to 
the big row with a nonzero block in the given big column 
is called an output type of A relative to the input of A, 
abbreviated as “a relative output type of A”. 

(3) . The effective output dimension of B is the number 
of output types of B, where an output type of B is a 
subspace oi Hb spanned by computational basis states, 
so that each computational basis state in such subspace 
has the same combination of being in or not in the output 
space of the partial permutation operators in the B space 
of U. It turns out that for this definition of the output 
type of B, it suffices to consider a linearly independent set 
of r partial permutation operators in the B space of U, 
which form a basis for the B space of C/, and we call such 
revised definition the simplified definition. Such a basis 
of r partial permutation operators do exist, and they can 
be selected from the ds x ds blocks in the matrix U. 
Any other partial permutation operator in the B space 
of [/ is a linear combination of these r basis operators. 
Suppose the simplified definition is inequivalent to the 
original definition. Then there are two computational 
basis states in the output space Hb so that they are 
simultaneously in or not in the output space of any of 
the r basis operators, while one and only one of them 
is in the output space of another partial permutation 
operator Qs in the B space of U. The Qs is a linear 
combination of the r basis operators, each of which has 
row sums being equal between the two said output types, 
hence the row sums of Qs are equal between the two said 
output types, and we have arrived at a contradiction. 
Therefore, the simplified definition is equivalent to the 
original definition. 

Protocol 18 (A protocol that implements a general bi¬ 
partite permutation unitary U.) The circuit diagram for 
the protocol is shown in Fig.[TJ The steps of the protocol 
are as follows. 

1 . Alice prepares an ancilla a in the state |0), and per¬ 
forms a controlled-A-f gate on A and a (with projectors 
on Ha of rank possibly greater than one) so that the sys¬ 
tem a stores in its Z basis the information about the type 
of input state on system A, which is defined in Def. [Hi), 
and is abbreviated as “the input type of A”. The integer 
jG{ 0 ,l,...,d — 1 } labels the type of the input state of 
A, where d is the dimension of system a. The X is the 


cyclic shift gate l(j + 1 ) ^) 0 I (note it was the 

minus sign in and Protocol | 6 ] instead of the plus sign). 

2. Alice sends the Z-information about a to Bob’s 
side, so that Alice has a copy a storing the Z-information 
about a, and Bob has a copy e'. This requires a prior 
shared maximally entangled pair of d-dimensional qudits 
ee' in the state Sj=o \jj)^ nnd involves a controlled 
cyclic-shift gate on ae and a measurement of e in the 
standard basis on Alice’s side, with the outcome sent to 
Bob using a classical channel, and a cyclic-shift gate on 
e' on Bob’s side according to the measurement outcome. 

3. Bob has an ancilla system /o initialized in |0). 
He performs a controlled permutation unitary W on e' 
(which now stores the input type of A), fo and B, with 
e' being the control, to prepare the output type of A 
on the output system / relative to the input /o [defined 
in Def. I17l ii')] , and at the same time prepare the output 
state of B (under the action of U) on the system B. Note 
that if the input fo and the corresponding output / for 
the gate W are removed, the W would not be unitary in 
general. 

4. Bob measures e' in the Fourier basis and a phase 
correction (an integer power of Z = 

is done on the a by Alice according to the measurement 
outcome sent classically. Bob teleports / to the A side, 
denoted as /'. 

5. Alice performs a controlled permutation unitary 
gate V on three systems A, a, and /', with the joint 
system af being the control, to get the output of A. 

6 . The remaining task is to erase the state on a and 

/'. The a stores the input type of A, and the /' stores 
the relative output type of A, and both are determined 
jointly by the output of A together with the output type 
of B. Hence a preparation of a system h containing the 
output type of B [defined in Def. fTTI ihl] is needed, and 
the h is teleported to the A side (and denoted h'), for 
Alice to erase a and /' to |0)a|0)/' by a controlled per¬ 
mutation unitary gate T acting on Aaf'h', with the joint 
system Ah' being the control. Finally the h' is measured 
in the Fourier basis and the outcome is sent to Bob clas¬ 
sically, and a phase correction is done on system B. The 
phase correction gate is denoted as an integer power of Z 
to indicate that it is a diagonal operator with eigenvalues 
being the d-th roots of unity but with some degeneracies, 
where d is the number of the output types of B. This 
completes the protocol, with the output of U in systems 
A and B. □ 

The following lemma gives an upper bound of the max¬ 
imum number of types of the input state on system A 
defined in Def. [I2l(i)- A matrix occupies a column if and 
only if it has a nonzero element in that column. Suppose 
S' is a set of nonzero d x d partial permutation matrices. 
A subset S' C S is called a covering subset if and only if 
any two matrices in S' do not occupy the same column, 
and any column is occupied by some matrix in S'. A ba¬ 
sis of S is a maximal linearly independent set of matrices 
in S. 
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FIG. 1: The circuit diagram for Protocol 1181 It implements any bipartite permutation unitary U on the system AB, 
using LOGO and prior shared entanglement. The latter is explicitly shown using wavy lines or implied in the 
teleportation steps shown in solid vertical lines with arrows. The initial entangled state on the system ee' is 
Sfc l^)e|fc)e', where d is the dimension of both e and e'. The F is the Fourier transform gate. The W, V, T are 
controlled permutation gates defined in the protocol. The top input line to W is in the same state as that of system 
a after the first controlled-X-’ gate, which stores in its computational basis the input type of A. The / at the second 
output line of W is the output type of A relative to the input type of A. The h' is the output type of B. The W is 
controlled by the first input line (i.e., the system e'), and the V is controlled by the second and third lines, and the 

T is controlled by the first and the fourth lines. 


The Bell number B^ is the number of different ways to 
partition a set of r distinguishable elements, regardless of 
the order of partitions and the order of elements within 
each partition. By simple calculation, Bi = 1,52 = 
2, B3 = 5, B4 = 15, B5 = 52, and it is known that B^ < 
[0.792r/ logg(r + 1)]’’ for any integer r > 1 [ 2 ^. 

Lemma 19 Suppose S is a set of nonzero d x d partial 
permutation matrices which include exactly r linearly in¬ 
dependent matrices, and each column is occupied by some 
matrix in S. The number of covering subsets of S is not 
greater than B^+i- 

Proof. The assertion apparently holds when r = 1. In 
the following we assume r > 2. From Lemma [Tdi vil. the 
size of S is at most 2’’ — 1. A covering subset of S can 
contain at most r elements, since elements of a covering 
subset must be linearly independent. 

Let us fix a basis of r linearly independent matrices in 
S. From the proof of Lemma llli vi'). there are r posi¬ 
tions (matrix elements) oi d x d matrices that determine 
a partial permutation matrix in the space spanned by 
the r basis matrices. Let us call these r matrix elements 
as “key elements”. Some of the key elements may be in 
the same column. For any two matrices in the same cov¬ 
ering subset of S, they occupy disjoint sets of columns, 
hence they cannot both contain I’s at the position of the 
same key element, nor can they contain a “1” respec¬ 
tively at one of two different key elements in the same 
column. Hence any matrix in a covering subset of S is 
characterized by a set of key elements among the given r 


key elements, and a covering subset of S is characterized 
by a partitioning of the key elements, but possibly with 
some key element (s) not belonging to any matrix in the 
covering subset, in the latter case we arrange the “extra” 
key element (s) into a partition, and mark this set with 
an auxiliary element, i.e. let the auxiliary element and 
the extra key element (s) be put into the same part in the 
partition of r -|- 1 elements. In the case that no extra 
key element exists, the auxiliary element is a part of the 
partition by itself. Therefore, the total number of cover¬ 
ing subsets of S is at most the partition number of r -|- 1 
elements, which is Br+i- This completes the proof. □ 

Now we introduce a new definition of the number of 
input types on system A (the definition for system B is 
similar), which will be used in Protocol EH below. If the 
sum of all blocks in a big column of U is equal to the 
corresponding sum for another big column, then these 
two big columns are regarded as of the same type in the 
loose sense. The reason for this new definition is that any 
input computational basis state on B is mapped to the 
same output state on B under the maps represented by 
the two big columns which satisfy that the sum of blocks 
in them are equal. 

Lemma 20 The number of distinct types of big columns 
of a bipartite permutation matrix of Schmidt rank r in 
the loose sense is at most 2”“^. This bound is tight. 

Proof. Denote the matrix as U, and denote the maxi¬ 
mum value of the quantity in the assertion as /(r), which 
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is a function of r only. The sum of all blocks in a big col¬ 
umn of U is in the B space of U, and is a matrix with 
elements being 0 or 1 and with sum of elements in each 
column equal to 1. By an argument similar to that in the 
proof of Lemma [TW iib there are at most 2'’“^ such ma¬ 
trices in the B space of U. By definition, two big columns 
of different types in the loose sense are different in the 
sum of their blocks. Hence /(r) < 2’’“^. The example 
of U that reaches the maximum value of 2’’“^ is in the 
proof of Lemma [TTr iil. □ 

Protocol 21 (Another protocol that implements a gen¬ 
eral bipartite permutation unitary U.) The circuit dia¬ 
gram for the protocol is shown in Fig. [21 The steps of 
the protocol are as follows. 

1 . Alice prepares an ancilla a in the state |0), and 
performs a controlled-gate on A and a (with projec¬ 
tors on TLa of rank possibly greater than one) so that 
the system a stores in its Z basis the information about 
the type of input state of A in the loose sense, which is 
dehned before Lemma EOl She teleports a to Bob’s side 
using prior shared entanglement and LOCC. Similarly, 
Bob prepares an ancilla b storing the information about 
the type of input state of B in the loose sense, and he 
teleports b to Alice’s side. 

2. Alice performs a controlled-permutation unitary W 
on A', A and the teleported b, with A and b as the con¬ 
trol, and the Al was initialized in | 0 ) before such gate. 
The controlled operator acting on A' in the gate W is a 
permutation unitary that only swaps the | 0 ) state with 
the output state determined by the state on the control 
registers, and keeps other Z basis states of A' unchanged 
(those states are not the actual input state anyway). Af¬ 
ter the W, the ^-information about the output of A un¬ 
der the action of U is stored in the Z basis of A'. Sim¬ 
ilarly, Bob performs W and the B' now contains the Z 
information about the output of B under U. 

3. Alice teleports b back to Bob’s side, and Bob tele¬ 
ports a back to Alice’s side. Each party performs the 
inverse of the controlled gate in step 1 to erase the a and 
b to | 0 ). 

4. This step is similar to step 1, except that W instead 
of U is considered here, and the A' and B' are regarded 
as the input for the unitary W. An ancillary system a' 
is initialized in |0), and after the controlled gate on A! 
and o', the a! contains the type of state of A' in the loose 
sense, and is teleported to the other side. Similarly, the 
b' containing the type of state of B' in the loose sense is 
teleported to Alice’s side. 

5. This step is similar to step 2. The controlled per¬ 
mutation gates T and T are defined similar to the W and 
W in step 2, but with instead of U and the A! and B' 
taking the role as the input for . Because of the form 
of the T and T gates and the states of A and B just prior 
to this step, the A and B are erased to |0). 

6 . This step is similar to step 3. Alice teleports h' back 
to Bob’s side, and Bob teleports a' back to Alice’s side. 
Each party performs the inverse of the controlled gate in 


step 4 to erase the a' and 6 to |0). This completes the 
protocol, with the output of U in systems A' and B'. 

In the protocol above, we need to erase the A, B (which 
become ancillary systems in the end) and other ancillas 
to some fixed state, because no information about the 
input should be leaked to ancillas in the end; otherwise 
the protocol does not implement a unitary operator (c.f. 
3, Sec. II C). The above protocol computes the correct 
output states on A'B' for the input computational states 
on AB without introducing extra phases, and by linear¬ 
ity, it implements the unitary U on all input quantum 
states. 

Theorem 22 Any bipartite permutation unitary of 
Schmidt rank r can he implemented using local operations 
with the help o/min{log 2 (Hr +i)-f r-h log 2 r, 8 r — 8 } ebits 
of entanglement and twice as many c-bits. 

The proof of this theorem is given in Appendix jF] 
This significantly improves over the result in Theorem 
22 of [9|, which states that such unitary can be imple¬ 
mented using LOCC with 3x2’' ebits. Since B^ < 
[0.792r/logg(r -|- 1)]’' for any integer r > 1 [^, the first 
term in the result of Theorem [22l scales as 0(rlog(r)), 
while the second term 8 r — 8 scales as 0 (r), but the first 
term is smaller for many integer values of r, at least in¬ 
cluding all r < 1100 (note that for very small r, the exact 
value of Br+i is used in the calculation rather than the 
asymptotic bound above). Also, note that the duration 
of time of classical communication in Protocol HH] (not 
including the time for entanglement preparation) could 
be as low as 3L/c (since the two teleportations from the 
B side to the A side can be done simultaneously with 
the sending of the classical message m; there are also 
two classical messages I and n sent from A to B before 
and after such step), where L is the distance between the 
two parties, and c is the speed of light. The communica¬ 
tion time required by Protocol HI] is also 3L/c, since the 
middle two among the four stages of teleportations can 
be combined into one. Combining the considerations of 
entanglement cost and communication time, Protocol lI 8 l 
has a definite advantage over Protocol HU for small r. In 
the case r = 4, an improved bound is provided by the 
following corollary: 

Corollary 23 Any bipartite permutation unitary of 
Schmidt rank four can be implemented using LOCC with 
the help of not more than 10.71 ebits of entanglement. 

Proof. Denote the bipartite permutation unitary as 
U. If there is a big column of U containing four nonzero 
blocks, from Lemma M ( iv), 17 is a controlled permu¬ 
tation unitary with four terms, hence the entanglement 
cost is at most log 2 4 = 2 ebits. If there is a big column 
of U containing three nonzero blocks, from Lemma [T2| 
(vi) and (vii), the entanglement needed is not more than 
max{2 -I- log 2 2,1-1- log 2 3} = 3 ebits, under a proto¬ 
col that may have up to three levels of control depend¬ 
ing on U. For the remaining cases, there is a formula 


14 



FIG. 2: The circuit diagram for Protocol 1211 It implements any bipartite permutation unitary U of Schmidt rank r 
with input in AB and output in A'B', using LOGO and at most 8 r — 8 ebits of entanglement, where r is any 
positive integer. A solid inclined line with arrows represents teleportation. The W, W, T, T are controlled 
permutation gates defined in the protocol. The W is controlled by the systems A and (teleported) b, and the T is 
controlled by the systems A! and (teleported) h'. The a stores the input type of A in the loose sense, while a' is for 
the output type of A in the loose sense. Hence the dimensions of a and a' may be unequal. Similar statements can 

be said for the B side. 


log 2 (Hj.+i • r • 2 '’) in the proof of Theorem [221 and the 
r term is now replaced with 2 because any big column 
of U contains at most 2 nonzero blocks. This gives 
log2(52 X 2 X 16) < 10.71 ebits. Taking into consider¬ 
ation all cases, the entanglement cost of U is not greater 
than 10.71 ebits. □ 

In Theorem IMI il below, the two methods for imple¬ 
menting a bipartite permutation unitary in the proof of 
Theorem [221 are adapted to the classical bipartite re¬ 
versible circuits after simple changes. The implemen¬ 
tation in Theorem [2^ iil below has some ancillas with 
final values not equal to initial values. Generally, in a 
classical computation on one party that uses reversible 
gates only, if it is required to restore the ancillas to their 
initial value in the end, we may copy the computation 
result by CNOT gates (the CNOT is a reversible gate) 
to some blank register, and the other ancillas can be re¬ 
stored to their initial value by running the inverse of the 
orig inal reversible circuit. Such process is discussed in 
[lll| . and a significantly modified method is used in Pro¬ 
tocol 121 ] (modification is needed because the initial in¬ 
puts are still present after the first part of the protocol, 
and they should be gotten rid of in the end for imple¬ 
menting a quantum unitary operation), helping us obtain 
the 8r — 8 term in the result about entanglement cost in 
Theorem [221 The Theorem | 21 | (i) below can be directly 
adapted for quantum circuits that do not use entangle¬ 
ment but use nonlocal CNOT gates, as stated in (hi). 
In the following, a bipartite classical reversible map is a 
reversible map from n + m bits to n -I- to bits, where the 
n bits are on party A, and the to bits are on party B. 
The matrix of such map is a permutation matrix. The 


Schmidt rank of a bipartite classical reversible map is de¬ 
fined as the Schmidt rank of the corresponding quantum 
map, which is a bipartite permutation unitary and has 
the same matrix as the bipartite classical reversible map. 

Theorem 24 (i) Any bipartite classical reversible map 
of Schmidt rank r can be implemented using classi¬ 
cal local reversible gates and min{ 2 |"log 2 (i 3 j. +i)] -|- 2 r-|- 
2 |"log 2 r], 8 r — 8 } classical nonlocal CNOT gates, if an¬ 
cillas start with some known value and are required to be 
restored to the same value at the end. 

(ii) Any bipartite classical reversible map of Schmidt rank 
r can be implemented using classical local reversible gates 
and 2r — 2 classical nonlocal CNOT gates, with ancillas 
starting with some known value but without any require¬ 
ment about their final value. 

(Hi) The assertion (i) also holds for quantum circuits, 
when the terms “classical reversible map”, “classical local 
reversible gates”, “classical nonlocal CNOT” and “value” 
are replaced by “permutation unitary”, “local permuta¬ 
tion unitaries”, “nonlocal CNOT” and “computational 
basis state”, respectively. 

The proof of this theorem is given in AppendixjGj Note 
that Theorem [MI ii) does not have a corresponding state¬ 
ment for the quantum permutation unitaries, because to 
implement a unitary operator, the ancillas at the end of 
the protocol should not contain information about the 
input, as mentioned in the proof of Theorem [221 Also 
note that we do not know whether Theorem | 21 | holds if 
all ancillas are required to start in some unknown state. 
This kind of consideration also appears in (2lj| . which 
uses the term “borrowed bit” to describe an ancillary bit 
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whose initial value is not known and is returned to the 
initial value at the end of the computation. On the other 
hand, in Theorem 1221 there is no specific requirement on 
the ancillas, so ancillas initialized in fixed quantum states 
are allowed and are actually used in the protocols in the 
proof. 


C. Examples 


The simplest examples of Schmidt-rank-four permuta¬ 
tion gates are the two-qubit SWAP and DCNOT (double- 
CNOT gates. In the following we show a more non¬ 
trivial example of Schmidt-rank-four permutation gate, 
that can be implemented using Protocol UHl The Exam¬ 
ple [55] below is about a unitary which is the product of a 
few transpositions on the input system, where a transpo¬ 
sition is a swap of two states among the computational 
basis states. Such gates are of interest for quantum com¬ 
putation: In quantum algorithms involving queries such 
as the Grover’s algorithm [2^, [2^, the oracle often acts 
nontrivially on only one or a few computational basis 
states, and is either a complex permutation gate or per¬ 
mutation gate, and in the former case it can often be im¬ 
plemented by a permutation gate with the help of ancilla 
qubit(s), which is illustrated in [2^ in case of Grover’s 
algorithm. We consider the problem of minimizing the 
entanglement cost across some bipartite cut of the whole 
input system. This is not only useful when the two par¬ 
ties are located in separated locations, but is also useful 
for a local quantum computer where some gates between 
certain sets of qubits may be harder to implement than 
other gates due to the design of the layout of the qubits, 
etc. In the latter case the GNOT-gate cost may be a more 
relevant measure than entanglement cost, but our proto¬ 
cols can easily be modified to use CNOT gates across a 
bipartite division of the whole system instead of using en¬ 
tanglement and classical communication (both cases are 
with the help of local gates), usually with linear over¬ 
head. An example for such overhead is in the proof of 
Theorem |24l which is for classical reversible circuits but 
can be immediately translated into a result for quantum 
circuit involving permutation gates. 


Example 25 Suppose 17 is a Schmidt-rank-four permu¬ 
tation unitary on a 5 x 6 dimensional system. The matrix 
form of U expressed using blocks is 


where Ti = diag(l, 1 , 1 , 0 , 0 , 0 ), and 

/O 0 0 1 0 0\ 

0 0 0 0 1 0 
0 0 0 0 0 1 
000000 ’ 
0 0 0 0 0 0 
Voooooo/ 

and Ts is the transpose of T 2 , and 

/O 0 0 0 0 0 \ 

0 0 0 0 0 0 
000000 
0 0 0 0 1 0 ■ 
0 0 0 1 0 0 
Voooool/ 


(25) 


(26) 


The B space of U is spanned by Ti, r 2 , T 3 , r 4 , hence U is 
of Schmidt rank four. The U is a symmetric matrix, so it 
is easy to express the action of U as the swapping of some 
pairs of computational basis states. The U can be imple¬ 
mented using Protocol 1181 The effective input dimension 
of system A is three, because the second, third and fourth 
big columns of U all have the same two nonzero blocks 
T 2 and T 3 in them, so the corresponding three computa¬ 
tional basis states of Ha are regarded as the same type 
of input state of A. The effective output dimension of 
A relative to any of the input computational basis state 
of Ha is two, because there are only two nonzero blocks 
in each big column of U. The effective output dimen¬ 
sion of B is two, because the first three computational 
basis states oIHe appear in the output of Ti and r 2 , 
but not in T 3 or T 4 , so these three states are counted as 
one type of output state of B, and the same holds for 
the last three computational basis states oi Hb- Hence 
the Protocol m requires 21 og 2(3 x 2 x 2) < 3.59 ebits 
for this U. In contrast, implementing U using two-way 
teleportation (see the beginning of Sec. lYl) would need 
21 og 2 5 > 4.64 ebits. This shows that Protocol [T51 can 
sometimes be more efficient than two-way teleportation. 


D. Entangling power of bipartite permutation 
unitaries of small Schmidt rank 

To know how tight our upper bounds for the entan¬ 
glement cost for bipartite permutation unitaries of small 
Schmidt rank are, it is helpful to know the entangling 
power of those unitaries, since the entangling power (the 
quantity i7_E in 0 ) gives a lower bound for the entangle¬ 
ment cost under LOGG. Formally for a bipartite unitary 
U acting on systems AB, we have 


/ 

Ti 

T 3 

0 

0 

^ \ 

T 2 

0 

Ts 

0 

0 


0 

T 2 

0 

Ts 

0 


0 

0 

T 2 

0 

T 3 

V 

0 

0 

0 

T 2 

TaJ 


Ke{U)= ma^E{U{\a)\m- (27) 

(24) Here |a) and |/3) are pure states on system ARa and 
BRb respectively, Ra and Rb are local ancillas, and the 
E is the von Neumann entropy of the reduced density 
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matrix on one of the two systems ARa and BRb- From 
the definition of Ke, we have Ke{U) < log 2 r edits for 
any U of Schmidt rank r. 


Lemma 26 The entangling power and entanglement 
east of any Schmidt-rank-two bipartite permutation uni¬ 
tary are both 1 ebit. 

Proof. From Lemma [T5] (i), up to local permutation 
unitaries and possibly a relabelling of the A and B sides, 
we may write the Schmidt-rank-two bipartite permuta¬ 
tion unitary as C/ = Pi ® Is + P 2 ® Vb, where Pi,P 2 
are orthogonal projectors that add up to I a, and Vb 
is a permutation unitary satisfying that Vb 11) b = 
where f > 2 is an integer, and {|j)_B} is the computa¬ 
tional basis oi Hb- Suppose |l)n and |s)a are compu¬ 
tational basis states of Ra in the support of Pi and P 2 , 
respectively, where s > 2 is an integer. Then for the 
input product state ■^(IO)a -I- |s)a) 0 |1), the output 

is (|1).4 ® |1 )b + \s)a ® 1 ^) 3 ) which contains 1 ebit of 
entanglement. On the other hand, we have commented 
previously that the entangling power of any bipartite uni¬ 
tary of Schmidt rank r is at most log 2 r ebits. This shows 
the entangling power of any Schmidt-rank-two bipartite 
permutation unitary is exactly 1 ebit. 

From the basic controlled-unitary protocol and 
Lemma [TbT il (or from i), the entanglement cost of any 
Schmidt-rank-two bipartite permutation unitary is not 
greater than 1 ebit. Since the entangling power of 1 
ebit provides a lower bound for the entanglement cost, 
the entanglement cost of any Schmidt-rank-two bipartite 
permutation unitary is exactly 1 ebit. This completes 
the proof. □ 

It should be noted that the “entangling power” in 
Lemma 1261 can be understood as Ke or K/\e (also de¬ 
fined in |14|L or the amortized Ke or K^e over many 
copies of the unitary, since all four quantities are lower 
bounds for the entanglement cost which is 1 ebit in the 
current case. 

As a side note, we consider the entangling power of 
complex bipartite permutation unitaries of Schmidt rank 
two. Their entangling power Ke can take any value in 
the interval (0,1] (ebit). A simplest class of examples are 
locally equivalent to the ones in H: [/ = 

I iy/p(Jz ® CTz, where p G (0,1]. When the definition 
is extended to p = 0, C/ is a Schmidt-rank-one unitary, 
with Ke{U) = 0. By the continuity of Ke (see 0), 
when p is near zero, the Ke(U) is near zero while C/ is a 
Schmidt-rank-two diagonal unitary. When p is near 1/2, 
the Ke{U) is near 1. 

Entangling power of Schmidt-rank-three bipar¬ 
tite permutation unitaries. 

The Schmidt-rank-three bipartite unitary U cannot be 
on a 2 X 2 system 1^. Hence the maximum of dA and 
dB is at least three and it is indeed reachable. An ex¬ 


ample acting on C 


IS in 


(051) . The structure of 


the Schmidt-rank-three bipartite permutation unitary U 
has been partially investigated in Lemma 1161 (i). The 


following result gives a range for the entangling power 
of such unitaries, although we do not know whether the 
lower bound is optimal. The upper bound of log 2 3 ebits 
is likely not optimal for some unitaries, see case ( 1 . 1 ) in 
the proof. 

Proposition 27 The entangling power of a Schmidt- 
rank-three bipartite permutation unitary is at least 
log 2 9 — 16/9 ~ 1.392 ebits and at most log 2 3 ~ 1.585 
ebits. 

The proof of this Proposition is in Appendix |h 1 In 
the proof, the only case where the entangling power 
may be less than log 2 3 ebits is case (LI), in which 
case the entangling power of U is at least log 2 9 — 16/9 
ebits, and such U can be implemented using log 2 3 ebits, 
while in general U can be implemented using 2 ebits, 
according to Lemma IMi)- Hence the gap between 
the entangling power and the entanglement cost of a 
Schmidt-rank-three bipartite permutation unitary is at 
most max{2 —log 2 3, log 2 3 --(log 2 9—16/9)} < 0.42 ebits. 

Taking clue from the results above, we present the fol¬ 
lowing conjecture: 

Conjecture 28 (1) The entangling power of any bipar¬ 
tite permutation unitary of Schmidt rank three can only 
take one of two values: log 2 9—16/9 or log 2 3 ebits. 

(2) The entangling power of any bipartite permutation 
unitary of Schmidt rankr can only he one of f(r) distinct 
values, where f(r) is a finite integer-valued function ofr. 

Numerical calculations suggest that (1) is likely to 
hold. In the calculations we have assumed the most gen¬ 
eral form of initial product pure state with ancillas a 
and b, whose sizes are assumed to be equal to those of 
the corresponding input system A and B, respectively. 
The sizes of a and b need not be larger since it suffices 
to consider the Schmidt decomposition on a A and bB, 
respectively. 


VI. CONCLUSIONS 

We have improved the upper bound for the entangle¬ 
ment cost of bipartite unitary operators of Schmidt rank 
three under LOCC protocols. Lemma [5] implies a struc¬ 
ture theorem for Schmidt-rank-3 bipartite unitaries, as 
stated in Theorem (TUI We have presented a protocol at¬ 
taining the improved upper bound for the entanglement 
cost for such unitaries. We have also studied the struc¬ 
ture and entanglement cost of bipartite permutation uni¬ 
taries of Schmidt rank up to three, and presented two 
protocols for implementing bipartite permutation uni¬ 
taries of arbitrary Schmidt rank, and analyzed the en¬ 
tanglement and classical communication costs of the pro¬ 
tocols. These results are independent of the dimensions 
of the spaces that the unitary acts on, and they signifi¬ 
cantly improve over the corresponding results in Q . The 
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results are applied to classical circuits for implementing 
bipartite permutation operations, and the protocols we 
found are such that whether requiring the ancillas to be 
restored to the fixed initial state makes a difference in 
the required number of nonlocal CNOT gates. As for the 
complex permutation unitaries, our progress is mostly re¬ 
stricted to Schmidt rank three (apart from some results 
for special cases of general Schmidt rank in Lemma [131): 
Any Schmidt-rank-three bipartite complex permutation 
unitary that is not equivalent to a diagonal unitary under 
local permutation unitaries can be implemented with 3 
ebits and LOCC, but it remains open whether there is 
a constant upper bound of entanglement cost for imple¬ 
menting an arbitrary Schmidt-rank-three bipartite diag¬ 
onal unitary. 

We also have quantified the entangling power of bi¬ 
partite permutation unitaries of Schmidt rank two and 
three, and in the Schmidt-rank-three case the results 
suggest that there might be a gap between the entan¬ 
glement cost and the entangling power. The examples of 
Schmidt-rank three bipartite permutation unitaries ap¬ 
pearing in our proofs may be in some sense the simplest 
examples of a gap between the entanglement cost and 
the entangling power, if such gap exists at all: Although 
there are Schmidt-rank-two unitaries that may have such 
gap, those are not permutation unitaries and thus may 
be harder to study. Also, there is some correspondence 
between the permutation unitaries and the classical re¬ 
versible circuits. So if the gap exists, there might be some 
operational implications even classically. 

Looking at this gap problem from the limit of large 
Schmidt rank, an apparent open problem is whether the 
results of Theorems [22] and [24] can be improved. It is 
known that any total boolean function of rank r can 
be computed by a deterministic classical communication 
protocol with 0{^ ■ log(r)) bits of communication. The 
problem of implementing bipartite permutations might 
be a harder problem than computing a boolean function 
on bipartite inpnts, but it would be interesting to find 
out more about the relation between the two problems. 
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Appendix A: The proof of Lemma [9] 

Proof. Firstly, note that T 2 must have at least two dis¬ 
tinct eigenvalues, since otherwise U is of Schmidt rank 
2, violating the assumption that it is of Schmidt rank 3. 
Another observation is that the ratio Cj 2 /\/|cjip -f |cj 2 p 


(and hence 0 * 2 / \J\cji\^ |cj 2 p) takes at least two differ¬ 
ent values among different j, since otherwise U is expand¬ 
able using the two operators Cjild + Cj 2 T 2 and T 3 on the 
second system with any particular j, implying that U is 
of Schmidt rank 2. Let 

r3 = L»3 + ^3, (Al) 

where D 3 is diagonal, and all diagonal elements of E 3 
are zero. Then E 3 is nonzero. Since U is unitary, (HI 
implies that 

{Cjlld + Cj2T2 + Cj'iT^){CjxId + C*2T2 + = Id, 

{cjlld + + C*3r3 )(cji/(i -|- Cj2T2 J- Cj^T^) = Id,{A2) 

for all j G {1,..., K}. Given that T 3 r 3 = = Id, we 

subtract terms with or from both sides of each 
equation in (IA2I) . Since any Cji is real, the off-diagonal 
part of the resulting equations gives that 

CjiCj^El + CjiCjsE^ + Cj2C*^T2El + c*2Cj3E^T2 = 0 , 

CjiCj^E^ + CjiCj^E^ + Cj 2 C*j^E\T 2 + c* 2 ^j^T 2 E^ = 0 (A3) 

for all j G {1,..., K}. Since Cj^ are nonzero, we may 
divide both sides of the first equation in (IA3I1 by cjs , and 
obtain two independent equations of variables E 3 and 
E 3 T 2 by letting 0 * 2 /\/|cjip -I- |cj 2 P take two different 
values (the other two terms containing E^ and TjAg are 
viewed as “constants”). Hence E^ and E 3 T 2 are in the 
space H := span{El,T 2 El}. If Ug oc then T 2 is 

proportional to the identity matrix on the rows in which 
Ug is nonzero. The remaining diagonal elements of T 2 
are in the rows in which E^ is zero. By m and the 
unitarity of T 3 , the columns of E^ that contain these di¬ 
agonal entries (at the same positions in both T 2 and T 3 ) 
are also zero. Hence T 2 and are simultaneously block- 
diagonal under a block structure where the first block of 
T 2 is proportional to the identity matrix. It violates the 
assumption that T 2 and T 3 are not simultaneously diago- 
nalizable under a unitary similarity transform. Therefore 
H has dimension two. We discuss two cases. 

Gase (a). Here e\ and U 3 are not proportional, so 
they form a basis of H. We have T 2 EI = gE^ -|- hEl with 
complex numbers g, h. Since Ag and T 2 EI also form a 
basis of H, we have g 0- Then 

T^eI = E 3 (A4) 

with a diagonal matrix T 2 := (T 2 — hld)/g. Denote tj 
as the j-th diagonal element of T^. It follows from m 
and the unitarity of Tg that the row vector and column 
vector of Ug containing a diagonal entry of the same po¬ 
sition have equal norm. Let Cjk be the (j, k) element 
of Ug. Let S := {j : 3k s.t. ejk 0}. Then it follows 
from (IA4D that tj for those j G 5 all have modulus one. 
It follows from (IA4I) that = ejk and = Ckj, 

Vj, k G {1, • • • , d}. So if Cjk ^ 0, then j € S and tj = tk- 
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Then tj 7 ^ tk implies Cjk = 0. The last result, combined 
with the definition of T 2 and dUD, implies that T 2 and T 3 
are simultaneously block-diagonal, where the blocks are 
such that each diagonal block of T 2 is a scalar matrix. 
Hence T 2 and T 3 are simultaneously diagonalizable un¬ 
der a unitary similarity transform. It is a contradiction 
with the assumption in the lemma. So case (a) has been 
excluded. 

Case (b). Hence e\ and are proportional. By ad¬ 
justing the phase for E^, while multiplying all 0^3 by a 
corresponding phase factor to keep U unchanged, we have 
eI = E^. Applying this equation to the two equations 
in (IA3I) . we have 

A CjlCj3)ii/3 

= Cj2C*2T2E3 + C*2Cj3E3T^ 

= C,2C%E3T2 + C*2C,3TIE3 (A5) 

for all j € {!,..., AT}. Left-multiplying the last line 
(which is equal to the first line) by T 2 and right- 
multiplying it by t|, we obtain the second line, which 
is also equal to the first line, thus we have (cjic *3 -I- 

CjiCj 3 ){E 3 — T 2 E 3 T 2 ) = 0. Since the unitaries T 2 and 
T 3 are not simultaneously diagonalizable, we have 

CjiC* 3 +CjiCj 3 =Q, Vj € {!,..., a:}. (A 6 ) 

Hence the first line of (IA5I) is zero, thus the second line of 
(IA5I) is zero, and since Cj 2 and Cjs are nonzero, we have 
T 2 E 3 oc E 3 T 2 . We may adjust the phase of T 2 (while 
multiplying all Cj 2 by a corresponding phase factor) so 
that 

T 2 E 3 = -E 3 TI (A7) 

From (IA 6 I) and the fact that all Cji are positive, we obtain 
that all Cj 3 are pure imaginary. The last two statements, 
combined with that the second line of (|A5I) is zero, imply 
that all adjusted Cj 2 are also pure imaginary. 

In the rest of the proof we use three assumptions. First, 
up to a relabeling of the computational basis states of 
'Hb,T 2 = ©Li and T 3 = ©Li where 
and both act on the subspace 'Hbu of and 
is diagonal. Second, and commute, and the 
order of the matrix is the largest possible under this 
requirement and the first assumption. Of course it may 
be possible that such order is zero. If the order is nonzero, 
there is a unitary change of basis in the subspace %Bi , 
such that the transformed and Tg^^ are diagonal, 
while keeping the identity matrix in this subspace [see 
the Id term in (1161) ] unchanged. Third, for any /c > 1, 
and Tg^^ do not commute, and no Tg^^ can be block 
diagonal in a basis in which is diagonal. So any 
with fc > 1 has at least two distinct eigenvalues. It can 
be easily verified that the three assumptions as a whole 
is always valid, although it is possible that Ubi is a null 
space for some U. 


In the following derivations the k is always greater than 
1 unless otherwise stated. Using dUD, we have 

tP = + E^3''\ (A8) 

where is diagonal and the diagonals of Ag^^ are zero. 
Using (IA7I) and (IA 8 I) . we have 

r 2 ^'=Lf^ = -Af^(T 2 ^''V. (A9) 


(k) 

This equation and the assumptions imply that any T 2 ' 
has exactly two distinct eigenvalues with a 

real number ak- There exists a permutation matrix Pk 
such that 


EkT^’^'^Pl 

PuE^3^pI 


e-»=/W©(-e— 

0 \ 

0 j 


(AlO) 

(All) 


where dk and Ck are positive integers. Since Ag = Ag, we 
have Gg^^ = (Ag^L- Since Ag*^ is unitary, (|A8I) implies 
that any two row vectors of Ag ' are orthogonal, and 
any two column vectors of Ag^^ are also orthogonal. Our 

(fc) 

assumptions and the unitarity of Tg ^ imply that there 

(fc) 

is no zero row or column in Ag ^. The last two sentences 
imply Ck = dk > 1. Then the unitary 


4"^ := Id, © i-Id,) e span{/('“'),(A12) 


satisfies = (A^^L- If dk > 1, suppose is 
nonzero. Let the V and D in Lemma[^correspond to Ag^^ 
and Ag^^ respectively. From the form of in (lAlOl) . 
and noting the form of the unitary similarity transform 

in Lemma [21 it can be found that Lemma |2] contradicts 

(fc) 

with the assumption that “no Ag ' can be block diagonal 
in a basis in which is diagonal.” Hence = 0. 
Then (|A8I) and (lAllI) imply that 


rp{k) 



0 



(A13) 


(fc) 

is a unitary matrix. Then Ag ' is a unitary of order dk- 
Let the D and U in Lemma |2] correspond to Cjil 2 d, + 
Cj 2 T^'’ and Ag^^ respectively, for some j G {!,..., K}, 
where K is from mi). From mi), there is a nontrivial 
linear combination of these two matrices that is a unitary, 
so it corresponds to V in Lemma |21 By noting the form 
of A 2 ^^ in (lAlOl) . and the form of the unitary similarity 
transform in Lemma |21 and the fact that a basis in which 
A 2 ^^ is diagonal is also a basis in which Cjil 2 dk is 

diagonal, and vice versa, it can be found that Lemma |2l 

contradicts with the assumption that “no Ag ' can be 

(fc) 

block diagonal in a basis in which A 2 is diagonal.” The 
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argument above excludes the possibility oi dk > 1- We 
have dfc = 1. 

The above argument implies that in (jAlOl) and 
in (IA13I) are both 2x2 unitary matrices, and 
detTg^^ = — 1. From (lAllI) . by doing a conjugation by 
a suitable diagonal 2x2 unitary: —>• QkT^^^Q^, we 

may assume that the two non-diagonal entries of Tg are 
equal and positive. The conjugation by the diagonal uni¬ 
tary 1 ^( 1 ) © {®k= 2 Q>^) change the Id and T 2 in 

(HU) since the latter are both diagonal. This completes 
the proof. □ 

Appendix B: The proof of Theorem 1101 

Proof. (i) The assertion follows from the following 
argument which uses Lemma 121 

The condition that C/ is a Schmidt-rank-3 bipartite 
unitary controlled from the A side implies dA > S and 
ds > 2. We consider the following decomposition of 
a general Schmidt-rank-three unitary U controlled from 
the A side: 

dA 

U = (Bi) 

J=1 

where the unitaries Ti, T 2 and T 3 are linearly indepen¬ 
dent, and other Tj G span{Ti, T 2 , Ta} are unitary. Using 
a local unitary on ILb, we assume Ti = Ib- We define 
the set Si := {Tj : Tj G span{Ti,T 2 }}, thus Ti and T 2 
are in Si. We refer to S2 as the set of Tj (including T 3 ) 
that are in span{Ti,T 3 } but not in Si. We also refer to 
S '3 as the set of Tj that are not in S*! U 5'2 . Every Tj in S '3 

is of the form Tj = J2k=i with nonzero and 

The set is the union of the disjoint sets ^i, 

S2 and S3. Let the part of unitary U corresponding to 
the set Sk be denoted by 144, k = 1,2,3. Using these 
notions and (EU, we have that up to a relabelling of the 
computational-basis states on Ha, 

U =WiQ)aW2(BaW3. (B2) 

Evidently each of Wi and W 2 has Schmidt rank at most 
two, and W 3 has Schmidt rank at most three. Consider 
the following two cases. 

Case (a): IU 3 has Schmidt rank not greater than two. 
In this case U is of the first standard form in assertion 
(i), according to (IB2I) . 

Case (b): W 3 has Schmidt rank exactly three. We may 
apply suitable local unitaries ohHb before and after U 
so that Ti = Ib and T 2 is diagonal, thus in the case that 
T 2 and T 3 are not simultaneously diagonal. Lemma [2] 
could be applied to W 3 = ® Ib)W 3 , where is a 

diagonal unitary on the subspace of Ha that W 3 resides 
in, so as to let W 3 satisfy the assumption in Lemma [2] 
that all Cji are real, thus W3 is of the second standard 


form in assertion (i), then so is W 3 . The case that T2 
and T 3 are simultaneously diagonalizable is excluded in 
the assumptions of Lemma [21 but this case is possible, 
and W 3 is locally equivalent to a diagonal unitary in this 
case, so the second standard form in assertion (i) still 
holds for W 3 . Then since Ti,T 2 ,T 3 span the B space of 
U as well as the B space of W 3 , the unitary U also is of 
the second standard form. 

(ii) Since U is controlled from the A side, it can be 
implemented using the basic controlled-unitary protocol 
with log 2 dA ebits and LOCC. This gives the dA term 
inside the min{} symbol in Eq. (flSl) . 

The two-way teleportation protocol with the B sys¬ 
tem being teleported, gives the d\ term inside the min{} 
symbol in Eq. ®. 

If U is of the first standard form in assertion (i), the U 
is a two-level controlled unitary, where the higher level 
controls which of the (up to) three unitaries Wi,..., W 3 
is to be implemented in the lower level. Each of the 
three unitaries in the lower level is a controlled unitary 
of Schmidt rank two, thus there is one side in which it 
is controlled with two terms Q. Thus U can be imple¬ 
mented under Protocol |7| with at most log 2 3 +1 = log 2 6 
ebits and at most 2 log 2 3 + 4 = 2 log 2 12 c-bits. Since 
(i_B > 2 , the log 2 6 ebits is not greater than the entan¬ 
glement cost discussed in the next paragraph, and the 
relation of the entanglement costs in the current para¬ 
graph and the next paragraph is that the maximum is 
to be taken between these two, thus the log 2 6 term does 
not appear in Eq. 

Now consider the second standard form in assertion 
(i). We may use Protocol |S| with the choice of group 
being the dihedral group D2n with odd n. The group is 
of order 2 n, using the convention in (note that the 
same group is sometimes denoted as Dn in the literature). 
Erom the representation theory of dihedral groups [27j| . 
such group D2n has (n — 1)/2 irreducible two-dimensional 
representations and two one-dimensional representations. 
There are [(is/2j 2x2 blocks and possibly a I x I block 
on the B side of the expansion of the bipartite unitary, 
by viewing (as many as possible) pairs of I x I blocks 
as 2 X 2 blocks. Thus we have n = 2[(iB/2j + I, and 
the order of the group is 2n = 4[dB/2j +2. So the 
group-type protocol needs log 2 ( 4 [dB/ 2 j + 2) ebits. The 
asserted entanglement-cost upper bound (1181) is obtained 
by combining the results of the cases above. 

In all the cases mentioned above except the first stan¬ 
dard form in assertion (i), the number of bits of classical 
communication is twice the amount of ebits contained in 
the resource entangled state, thus the claim of classical 
communication cost in assertion (ii) holds. □ 

Appendix C: The proof of Lemma 1131 

Proof, (i) Since t/ is a complex permutation matrix, 
any two nonzero entries in U are in different rows of U. 
So the first assertion holds. The number of the nonzero 
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blocks cannot exceed the Schmidt rank of U, which is r. 
On the other hand the number cannot be zero because 
U is unitary. So the second assertion holds. 

(ii) Up to local permutation matrices, we may assume 
that the first r blocks in the big row are nonzero, and 
the remaining blocks in the big row are zero. Since U is 
a complex permutation matrix and each block is of size 
cLb X ds, there are exactly ds nonzero entries in distinct 
rows of the big row. If the r nonzero blocks in the given 
big row contains a common zero column vector, then any 
linear combination of them contains a zero column vector 
of the same position. And since U is of Schmidt rank r, 
any block of U is zero in that particular column. It is a 
contradiction with the fact that U is unitary. So these 
r blocks do not contain any common zero column vec¬ 
tor. Since there are exactly ds nonzero entries in the r 
blocks, the nonzero entries in the r blocks are in different 
columns. So the assertion follows. Similarly, the asser¬ 
tion holds when all “row” are replaced with “column”. 

(iii) The first two sentences in the claim follow from the 
fact that any block in U is the linear combination of the 
r nonzero blocks described in (ii). The last sentence in 
the claim is reached by using the basic controlled-unitary 
protocol. 

(iv) The argument is exactly similar to the proof of 
(ii)(iii), so we abbreviate it here. 

(v) Up to local permutation unitaries, we may assume 
that the hrst big row of U contains exactly r — 1 nonzero 
blocks, and the nonzero blocks in it are the first r — I 
blocks, with the first one being equal to /s© 0 dB-s, where 
1 < s < dB—r+2. In the following we prove that up to lo¬ 
cal permutations all the r — 1 nonzero blocks in the first 
big row can be written as orthogonal projectors. Sup¬ 
pose this were not true, then there would be at least one 
common zero column in these r — 1 nonzero blocks, and 
the r-th linearly independent block in U must contain a 
nonzero element in this column. The linear combination 
of the r-th block and the first r — 1 blocks (with nonzero 
coefhcient for the r-th block) can appear at most once in 
each big row except the first big row, but must appear in 
each big column. Thus the count of such linear combina¬ 
tion is both not more than — 1 and exactly equal to 
dA, and this is a contradiction. Hence, up to local per¬ 
mutations all the r — 1 nonzero blocks in the hrst big row 
can be written as orthogonal projectors. So the assertion 
holds. 

(vi) The conditions imply that = ^b, 

= dB, and the sum of orders of Q and Qj 
(Vj < n) is dA- These facts are used in the following 
proof. 

Since U has Schmidt rank r, there is the r’th linearly 
independent block in U. This is a complex partial permu¬ 
tation matrix named as R. We regard it as a partitioned 
matrix 

r— 1 

R-.= Y. \d){k\^Rjk, (Cl) 

j,k=l 


where the subblock Rjk is of size sj x Sfe. In particular, 
the diagonal subblock Rjj is in the same position and 
of the same size as that of Pj in any diagonal block of 
U. Up to local permutation matrices on {7, we may use 
the hypothesis that n is the integer such that for any 
j € [l,u], there is a nonzero Rj,ki or Rk 2 ,j] and at the 
same time any Rj,ki and Rk 2 ,j are both zero when j > n, 
j ^ ki,k 2 , and ki,k 2 G [l^r — 1]. In other words, R is 
the direct sum of the upper left (X]j=i ^j) ^ (Sj=i ^j) 
submatrix R' and r — n — 1 subblocks Rjj of size Sj x 
Sj, j = n -I- 1, • • • , r — 1, where the integer n G {0} U 
{2,...,r — 1}, since n = 1 implies that there is a nonzero 
off-diagonal block Rik where fc > 2, meaning that n > 2, 
thus the case n = 1 does not exist. 

Since [/ is a complex permutation matrix, any block 
of C/ is a complex partial permutation matrix, which is 
the linear combination of Pi,-- - ,Pr-i and R. These 
facts, CH) and the hypothesis imply that Pi, - - - , 
do not appear in the linear combination containing R 
of nonzero coefficient. So any block in U is either the lin¬ 
ear combination of Pi,-- - ,Pr-i, or the direct sum of 
R' multiplied by a phase and r — n — 1 subblocks of 
size Sj X Sj, j = n + 1, - - - , r — 1. The hypothesis im¬ 
plies that each subblock is the linear combination of Rjj 
and Pj. The submatrix of U on the bipartite Hilbert 

space Ha x span{|si H--I- s„ -I- 1), • • • , Ids)} form the 

second bracket in (|21l) . The remaining part of unitary 
U, named as U', acts on the bipartite Hilbert subspace 
Ha X span{|l), • • • ,|si-|—--l-Sn)}. The above argument 
implies that each block of U' is the linear combination 
of Pi, • • • ,Pn and R'. In particular, the block has to be 
proportional to R' when R' appears in the linear com¬ 
bination. The hypothesis also implies that the big row 
or big column of U' containing R' does not contain any 
other nonzero block. So R' is a complex permutation 
matrix. By letting R' = P, we can decompose U' into 
the expression in the first bracket of m- For the Uj in 
the last big bracket in m, it has Schmidt rank at most 
two, since R and the term with the specific Pj (where 
j > n) each contributes at most 1 to the Schmidt rank. 
So the first paragraph in the claim holds. 

The last paragraph in the claim is from a multiple-level 
recursive control protocol generalized from Protocol [B] 
In the case n G [2,r — 2], the protocol has three lev¬ 
els. The hrst level is choosing between the two terms in 
(l2T]l . If the choice is the hrst term, the second level then 
chooses between the two terms in the hrst big bracket 
in m- Otherwise, the second level chooses between the 
terms in the last big bracket in (I^Tl) . and the third level 
implements a Schmidt-rank-two unitary using the basic 
controlled-unitary protocol. In the case n = 0, the pro¬ 
tocol similarly has three levels but the hrst branch in the 
choices does not have the second or third level. In the 
case n = r — 1 , the protocol has only two levels since the 
last term in m does not exist. In all cases, the low¬ 
est level of the protocol is the basic controlled-unitary 
protocol. 

(vii) Let U he a real permutation matrix and let it be 
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of the form of the n = 0 case in (1211) . We can instead ex¬ 
pand the U using orthogonal projectors Pi, P 2 , ■ ■ ■, Pr-i, 
and the matrix R on the B side, where R is defined in 
the proof of (v) and is block-diagonal in the sense that 
Rjk =0 ioT j ^ k in (ICll) . since n = 0. If i? is a diagonal 
matrix, then R cannot be the identity matrix since then 
it would be in span{Pi,..., 1 }, violating that U is of 

Schmidt rank r. But R can be of less than full rank under 
the assumption that R is diagonal, and in such case U is 
a controlled permutation matrix controlled from B side 
with at most 2 (r — 1 ) terms, which is the second form 
for U in the assertion. Now suppose R is not diagonal. 
Any block in U cannot be a linear combination of R and 
Pi,, Pr-i with nonzero coefficient for R, since then it 
would have two nonzero elements in some row. Thus any 
block of U must be either i? or a linear combination of 
Pi,... Pr-i- Thus U is the A-direct sum of a unitary 
whose B space is spanned by R only, and another uni¬ 
tary whose B space is spanned by Pi,..., Pr-i, and the 
latter is a (r — l)-term controlled-permutation unitary 
controlled from the B side. This is exactly the form for 
the case n = r — 1 in (ED- Thus the assertion holds, 
and the statement about entanglement cost follows from 
Protocol m 

This completes the proof. □ 

Appendix D: The proof of Lemma 1141 

Proof, (i) The claim holds by definition. 

(ii) The equality obviously holds when r = 1. In the 
following we assume r > 2. Denote the unitary as 

dA 

U = Y.\jM®yr (Dl) 

f=i 

A class of examples U with 2’’“^ distinct diagonal blocks 
satisfy dA = 2’’“^, ds = 2r — 2, Vi = hr-i, and for 
k = 2,...,r,Vk :=l 2 r -2 + \2k-3){2k-2\ + \2k-2){2k- 
3|-|2fc-3)(2fc-3|-|2fc-2)(2fc-2|. The 2’-i diagonal 
blocks Vj are of the form Vr -|- J2k=2 VkiVk — Vr), where 
j/fc is 0 or 1 for each fe € [2,r]. Hence 

m{r) > 2’-b (D2) 

Now we proceed with the main proof. Up to local 
permutations on we may assume the first r diagonal 
blocks of U in (IDll) are linearly independent. We still 
denote them as Ui, V 2 ,..., W- Since each I 4 is the linear 
combination of them, we have I 4 , = J2k=i ^’k^^k- Since 
all Vj are permutation matrices, the sum of elements in 

each row of any Vj is 1. Thus we have = 1- 

These two equations imply 

r 

Vh-Vi=Y, 4^“^ (Vk-Vi). (D3) 

k=2 

For each k = 2, • • • ,r we regard the 14 — 14 as a d^- 
dimensional vector. Let the d\ x [r — 1) matrix M be 


consisted of column vectors 14 — 14, ■ ■ ■ , 14 — 14 ■ Since 
14," ) K- are linearly independent, M is of full rank 

r — 1. Since the entry sum in each row of the matrix 
14 — 14 is zero, we can perform fixed row operations 
on M, to make zero the ds rows corresponding to the 
nonzero entries of 14. The resulting matrix M' has the 
same rank as M, since row operations preserve the matrix 
rank. There is a matrix M” which is a (r — 1) x (r — 1) 
submatrix of M', obtained by deleting the ds zero rows 
and some other rows in M', which has the same rank as 
M, namely r — 1. Then (ID3|) is equivalent to the fact that 
the vector M" ■ • • • , has entries one or zero, 

since all entries of 14 are 0 or 1, and the nonzero entries 
of 14 are excluded by the deletion mentioned above. So 
there are at most 2’'“^ sets of solutions of xY\- ■ ■ , xY\ 
It implies m{r) < 2’’“^. Combining it with (ID2I) we have 
m{r) = 2 ’"“^. 

(iii) The claim follows from (ii) and the basic 
controlled-unitary protocol. 

(iv) A set of P-side Schmidt operators of U can be 
chosen to be a set of linearly independent ds x ds blocks 
in the matrix U, hence they are partial permutation ma¬ 
trices (but in general they cannot be an arbitrary set of 
partial permutation matrices, since they jointly have to 
have support on every input computational-basis state). 
Then the assertion follows by definition. 

(v) If m'{r) > 2*'“^, by assertions (i), (ii) and that 
r > 1 , there must be at least r linearly independent 
ones among these m'ir) distinct permutation matrices. 
Then assertion (ii) implies m! (r) = 2'’“^, a contradiction. 
Hence m'(r) < 2’'“^. But by definition m!(r) > m{r), 
hence m'{r) = 2 ’'“^. 

(vi) The following argument is almost the same as the 
last paragraph of the proof of Lemma 21 in Q. For 
completeness we include the rewritten argument below. 

Suppose {Fi}i=i ^ of r linearly independent ma¬ 
trices among the blocks of U. All nonzero partial permu¬ 
tation matrices in the P-space of U are linear combina¬ 
tions of {Fj}j=i- This last property still holds if we re¬ 
place {Pi}i=i with {Gi}Yi, defined as follows: Each Gi is 
a linear combination of and satisfies Gi{t) = Sn, 

i,t G {1,2,..., r}, where Gi(t) is the t-th matrix element 
of Gi according to some fixed ordering of the matrix el¬ 
ements, and Sit is the Kronecker delta. Such ordering 
of the matrix elements must exist but the exact choice 
depends on the set {Pi}i=i- We do not have extra re¬ 
strictions on the Gi(t) with t > r. Any nonzero par¬ 
tial permutation matrices in the P-space of P is a linear 
combination of Gi (f = 1 , 2 ,..., r), and the coefficient for 
each Gi is either 0 or 1 , since the resulting matrix is a 
partial permutation matrix which implies that its first r 
elements (in the ordering above) must be either 0 or 1 . 
Since we only consider the nonzero matrices, the coeffi¬ 
cients cannot all be zero, thus there are at most 2 ’' — 1 
nonzero partial permutation matrices in the P-space of 
U. This proves n{r) < 2’' — 1. 

The value 2’' — 1 is attained by a r-term controlled uni¬ 
tary controlled from the P side. To prove that no other 
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type (up to local permutation equivalence) of bipartite 
permutation unitaries U can achieve the value 2 ’" — 1 , 
we make use of the essence of the argument in the last 
paragraph of the proof of Lemma 21 in that is, there 
are r positions in the ds x ds matrix such that the value 
of these elements (each is 0 or 1, and is called a “key 
bit” below) determine the values of other entries of the 
matrices in the B space of U via fixed linear relations. 
Since there are 2 ’’ — 1 nonzero partial permutation ma¬ 
trices in the B space of U, it must be that every binary 
combination of the values of the r key bits except the all¬ 
zero combination appear in a partial permutation matrix 
in the B space of U. (Note that if the number 2 ’' — 1 
were a smaller number, in general any binary combina¬ 
tion of the values of the r key bits does appear in some 
matrix in the B space of U but such matrix might not 
be a partial permutation matrix.) Thus no two key bits 
are located in the same row or column, since otherwise 
the matrix corresponding to the two key bits being both 
1 cannot be a partial permutation matrix. Suppose one 
key bit is at position (ri,ci), i.e. row ri and column ci, 
and another key bit is at position (r 2 , C 2 ). By considering 
the {i,j) entry of the ds x ds matrix corresponding to 
both key bits being set to 1, where {i,j) ^ (ri,ci) and 
(bj) 7^ (’' 2 ,C 2 ), we find that such (i,j) entry cannot be 
both 1 in the two matrices corresponding to the two key 
bits being set to 1,0 and 0,1, respectively, as the latter 
two matrices add up to the former matrix. This shows 
that any matrix corresponding to only one key bit set to 
1 must be orthogonal to any other such matrix, where 
orthogonal means having no common nonzero rows and 
no common nonzero columns. And since the U is unitary, 
for any row and column in the ds x ds matrix there has 
to be at least one nonzero element appearing in a partial 
permutation matrix with only one key bit set to 1, thus 
the bipartite permutation unitary is equivalent to a con¬ 
trolled unitary from the B side under local permutation 
unitaries. This completes the proof. □ 


Appendix E: The proof of Lemma 1161 

Proof, (i) We call the first statement the “assertion.” 
In the following we prove the assertion first, then prove 
the statement about entanglement cost at the end. 

We use the same notations as in the proof of LemmalTSl 
First, if there is a big row or column of U containing 
three nonzero blocks, then from Lemma m (iv), U is 
equivalent to a three-term controlled-permutation uni¬ 
tary controlled from the B side, up to local permutation 
unitaries. 

Next, if there is exactly one nonzero block in each big 
row of [/, then up to local permutation unitaries, U is 
equivalent to a controlled-permutation unitary controlled 
from the A side. The number of terms is between the 
Schmidt rank r and 2”“^ by Lemma [TT] (ii). So it is 
either three or four. 

The remaining case is that there is a big row of U 


containing exactly two nonzero blocks. From Lemma [T3l 
(vi), we have a standard form in (I^Tl) . which satisfies the 
assertion except in the case n = 0. In the case n = 0, the 
assertion follows from Lemma [T3] (vii). 

Now we prove that the entanglement cost is at most 
2 ebits. In the first case in the assertion, the result fol¬ 
lows from the basic controlled-unitary protocol. In the 
only remaining case in the assertion, the result follows 
from Protocol ini where the higher level of this two-level 
protocol determines which of the product permutation 
unitary or the two-term controlled-permutation unitary 
is to be implemented in the lower level. The entangle¬ 
ment cost for the two-level protocol is log 2 2 + log 2 2 = 2 
ebits. For each ebit used in the protocols, two c-bits are 
used, hence the classical communication cost is not more 
than 4 c-bits. So the assertion holds. 

(ii) Suppose C7 is a Schmidt-rank-three bipartite com¬ 
plex permutation unitary that is not equivalent to a di¬ 
agonal unitary under local permutation unitaries. It fol¬ 
lows from Lemma [l3] (i) that some big column or row of 
U contains the number of at most three nonzero blocks. 
If the number is exactly three or two, then the assertion 
respectively follows from Lemma 1131 fiiii or (vi). It re¬ 
mains to investigate the case when the number is one. We 
exchange the A and B systems of U to obtain another 
matrix [/, which is still a Schmidt-rank-three bipartite 
complex permutation unitary. Since U is not equiva¬ 
lent to a diagonal unitary under local permutation uni¬ 
taries, the nonzero blocks of U do not have the same 
nonzero patterns (the pattern about which of the ele¬ 
ments are nonzero), hence there are two nonzero blocks 
of U such that there is nonzero element located in the 
same row within each block but at different column po¬ 
sitions. This means that some big row of U contains 
at least two nonzero blocks. The assertion again follows 
from Lemma [13] (hi) and (vi). 

(hi) Let the unitary he U = |1)(1| 0 Ib + |2)(2| 0 {xP+ 
yP^) -t- ® with two different phases x,y 

and a projector P onto some states in the computational 
basis of H_b, which can be assumed to be the hrst states 
in the basis, i.e. their labels are before the states in the 
support of the projector P^ := Ib — P- The Vj are 
diagonal matrices. We have U = Ui (Bb U2, where 

C/i = (|l)(l|+x|2)(2|)0P + 

dA 

j=3 

U 2 = (|l)(l|+y| 2 )( 2 |) 0 P^ + 

dA 

E ® (El) 

i=3 

Since U is of Schmidt rank 3, there is a Vj (denoted 
V 3 without loss of generality) that is not a linear com¬ 
bination of Ib and xP -|- yP^. Every other Vj is 
in span{lB,xP + yP^^Vs}. The matrices PV 3 P and 
P-'-VaP-*- are diagonal. 
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If PV 3 P has three or more distinct nonzero diagonal 
elements, then among the matrices PVjP there cannot 
be any linear combination of P and PV 3 P with nonzero 
coefficients for both terms, because of Lemma [TJiii) and 
the fact that the set {PVjP} U {P} contains exactly two 
linearly independent matrices, the latter is because the 
set {Vj} U {Ib,xP + yP-^} which span the B space of 
the Schmidt-rank-three unitary U contains exactly three 
linearly independent matrices. Thus every Vj is either 
proportional to I/ 3 , or is in spanj/s, xP + yP"*-}. Thus U 
can be written as P = ITi0a W 2 , where Wi is a Schmidt- 
rank-two unitary with the B space being span{/B, xP + 
yP^}, and the W 2 is a product unitary with the B space 
being spanned by V 3 . Thus U can be implemented using 
Protocol [HI with the lower level of this two-level protocol 
using at most 1 ebit of entanglement, and the higher level 
(choosing between Wi and W 2 ) using 1 ebit. Thus U can 
be implemented by 2 ebits and LOGO in this case. 

If P'^VaP-'- has three or more distinct nonzero diagonal 
elements, we similarly have that U can be implemented 
with 2 ebits and LOGO. 

Now suppose PV 3 P and P^'-PaP-*- each has at most 
two distinct nonzero diagonal elements. Apparently any 
PVjP is in span{P, PP 3 P}, thus the Ui (not U) is a uni¬ 
tary of Schmidt rank one or two, and can be written in 
a form of being controlled from the B side (up to local 
unitaries) with at most two controlling terms. Similarly, 
by considering P-'-VaP-*-, we get that U 2 is controlled 
from the B side (up to local unitaries) with at most two 
controlling terms. And since U = Ui (Bb U 2 , the U is 
locally equivalent to a controlled unitary with at most 4 
controlling terms on the B side, hence U can be imple¬ 
mented using at most 2 ebits and LOGG under the basic 
controlled-unitary protocol. 

Hence, in all cases, the U can be implemented by 2 
ebits and LOGG. □ 


Appendix F: The proof of Theorem 1221 


Proof. Denote the unitary as U. We first prove for 
the term log 2 (Pr-i-i) + r + log 2 r in the assertion. In 
the following we consider the cases that r > 4, and the 
method is just to apply Protocol [TH| to the unitary U. 
The cases of r < 3 will be mentioned later. 

The dimension of a (and e, e') in Protocol [18] is the 
effective input dimension of A, i.e., number of different 
input types of A, or the number of different big columns 
of A characterized by the set of nonzero blocks in the big 
column regardless of the order of the blocks. The effective 
input dimension of A is at most Pr+i, which follows from 
Lemma [TH| by noting the following: All the blocks of U 
are in the linear span of r linearly independent blocks in 
U, and we may regard the S in Lemma |19l as the set of 
all blocks in U, and each big column of U corresponds 
to a covering subset of S determined by which nonzero 
blocks are in the big column. 


The dimension of /' in Protocol [18] is the effective out¬ 
put dimension of A relative to the input computational 
basis state of Ha, and it is at most r, because there can 
be at most r nonzero blocks in a big column of U. 

The dimension of h' in Protocol [THI is the effective out¬ 
put dimension of H. In Def. ITTI iiil it is shown that the 
simplified definition is equivalent to the original defini¬ 
tion for the effective output dimension of H, thus there 
are at most 2’’ output types of B. It may be worth not¬ 
ing that the definition of such output types of B above is 
independent of the output of A, and this is for the final 
phase correction in Fig.|T]to be successfully carried 
out. 

Thus, when r > 4, the number of ebits needed in the 
whole protocol is at most log 2 (i?r-i-i ■r-2^) = log 2 Br+i + 
r + log 2 r < log2[0.792r/logg(r -|- 1)]’’ -|- r -|- log 2 r = 
0(r log r). For each ebit in the protocol, 2 c-bits are 
needed. 

When r < 3, the number of ebits needed are 0, 1, and 
2 ebits for r = 1,2,3, respectively, where the latter two 
results are from Lemma dHi and 1161 respectively. Again, 
for each ebit in the protocols, 2 c-bits are needed. 

The above shows that U can be implemented using at 
most log 2 (i?j.+i) -I- r -I- log 2 r ebits and twice as many 
c-bits. 

In the following we prove for the term 8r — 8 in the 
assertion. From Lemma and the symmetry of the two 
sides, the number of possible input types in the loose 
sense on each of the A and B sides is not more than 
2’’“^. Gonsider the Protocol [21] shown in Fig. |2j The a 
contains the input type of system A in the loose sense, 
so its dimension is at most 2’’“^. Hence the teleportation 
of a to Bob’s side requires at most r — 1 ebits and 2r — 2 
c-bits. Similarly, the teleportation of b to Alice’s side 
requires at most r — 1 ebits and 2r — 2 c-bits. Teleporting 
these systems back requires the same amount of nonlocal 
resources. Since W has the same Schmidt rank as U, 
the entanglement and classical communication cost of the 
second part of the protocol is bounded above by the same 
numbers as in the first part of the protocol. Hence, 8r —8 
ebits and 16r — 16 c-bits suffice to implement the U. 

Thus the assertion is proved by combining the upper 
bounds for the two protocols above. □ 


Appendix G: The proof of Theorem 1241 

Proof. (i) For r = 2 and r = 3, we use the ba¬ 
sic controlled-unitary protocol or the recursive controlled 
protocol [Protocol IHKa)] which are used in the proof of 
Lemma [TSKi) and (HKi), respectively, but with modifi¬ 
cations to use nonlocal GNOT gates instead of entangle¬ 
ment, similar to those below for the case of general r. For 
r > 4, we use the adapted versions of the two protocols 
in the proof of Theorem |22l The details would be given 
in the following paragraphs but the main idea is to use 
local classical reversible gates instead of the local quan¬ 
tum permutation gates, and replace the entangled state 
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and teleportation and the directly related LOCC opera¬ 
tions with the classical nonlocal CNOT gate. According 
to the definition of edits in Sec. Ell the non-integer en¬ 
tanglement cost in Theorem 1221 means that a maximally 
entangled state on k x k system is used, where k is not a 
power of 2. Since we are concerned with the CNOT gate 
cost, we extend such entangled state to be a maximally 
entangled state on a 2" x 2” system, where n £ N, and 
this gives the ceiling function in the assertion. In the 
following we consider the two protocols in the proof of 
Theorem [21] respectively. 

For the first protocol in the proof of Theorem]^ which 
is Protocol m we may use an integer number of nonlo¬ 
cal CNOT gates to prepare e' on the B side, where e' 
is the input to the W gate in Protocol [TSl and similarly 
the same number of nonlocal CNOT gates is needed later 
to erase the e', so two nonlocal CNOT gates are needed 
for every ebit in the ee' state in Protocol [HI The tele¬ 
portation of qubits from the B side to the A side are 
replaced with an integer number of the classical DC- 
NOT (double-CNOT, see the quantum version in [2^) 
gates, where each DCNOT gate includes a CNOT gate 
controlled from B, where the controlled bit on A is an 
auxiliary bit initially in the fixed value 0 , followed by a 
CNOT gate controlled from A. In other words, two non¬ 
local CNOT gates are used to transfer each bit from B 
to A while sending the auxiliary bit initialized in 0 from 
Ato B. The original teleportation needs one ebit to tele¬ 
port each qubit. Thus each term in the expression for the 
number of required nonlocal CNOT gates is at most two 
times the ceiling function of the number of ebits used in 
the corresponding part of Protocol [HI 

For the second protocol in the proof of Theorem [H] 
which is Protocol m each ebit can be turned into one 
nonlocal CNOT gate. For example, the first teleporta¬ 
tion of the a {b) system can be implemented by at most 
r — 1 nonlocal CNOT gates to send the information about 
the computational basis of the register a ( 6 ), and the tele¬ 
portation back later can be implemented by at most r — 1 
CNOT gates to erase the state on one party, and then 
the remaining local copy of a (b) can be locally erased 
by the inverse circuit of the local circuit used to prepare 
it. Thus the number of nonlocal CNOT gates needed is 
equal to the number of ebits used in Protocol [2TJ This 
completes the proof of (i). 

(ii) The following is the classical version of the first part 
of Protocol]!!] From LemmallOland the symmetry of the 
two sides, the number of possible input types in the loose 
sense on each of the A and B sides is not more than 2’'“^. 
Consider a classical circuit where Alice sends the input 
type of system A to the B side using r — 1 CNOT gates, 
and Bob sends the input type of B to the A side using 
r — 1 CNOT gates. Then each party computes the output 
of the local system, while keeping a copy of the inputs 
(both the local input and the received information about 
input types on the other system), in order to make the 
local circuit reversible, but this leaves some local ancillas 
with some value dependent on the inputs. Hence 2r — 2 


CNOT gates suffice under the condition in the assertion, 
(iii) The assertion follows from (i) as well as the fact that 
in the circuits in the proof of (i), the ancillas in the end 
do not contain information about the input. This last 
condition about the final state of ancillas is necessary for 
implementing a quantum unitary operation, and is actu¬ 
ally sufficient as long as there are no measurements and 
all gates are unitary; see Theorem 1 of □ 

Appendix H: The proof of Proposition 1271 

Proof. The upper bound follows from the definitions of 
the entangling power and the Schmidt rank of the bipar¬ 
tite unitary. To prove the lower bound, we consider three 
possible forms of U, which are studied in detail below. In 
all cases except case (I.l), the entangling power is log 2 3 
ebits. 

Case (I). Suppose C/ is a controlled permutation uni¬ 
tary with three terms, and is controlled from the A side. 
Up to local permutation matrices, we may assume 

U = Di 0 Ib 

+ B)2 0 {Im © /„ © Vi © V 2 ) 

+ T'3©(/m©V'3©/, © 14 ), (HI) 

where DjDk = SjkDj, = I a, and 14,14,14 and 

14 are permutation matrices. 14 and V 3 are respectively 
of size q X q and n x n, and 14 and I 4 are both of size 
p X p where p = dB—m — n — q. If 14 or V 3 contains a 
nonzero diagonal entry, then we can move the entry by 
local permutation matrices on 71 ^ so that Im is replaced 
with Im+i- So 14 and V 3 do not contain any nonzero 
diagonal entry. Similarly, we may assume that 14 and 14 
do not have a nonzero diagonal entry in the same column 
when p > 0. For the purpose of studying the entangling 
power of C/, we may assume that all Dj in (IHIII are one¬ 
dimensional projectors, since the input state is a product 
state. 

In the following we consider three cases. The first case 
(I.l) is that p = 0, namely 14 does not exist in (IHlIl . 
We perform U on the product vector |e)(|a) + \b) + |c)) 
where |a), |6), and |c) are respectively in the support of 
Inn In and Iq in (|Hip . If the resulting state is maximally 
entangled, then the three states |a) -|-16) -|- |c), |a) + |6) + 
14 1 c), and I a) + 1415) + |c) are pairwise orthogonal. The 
solution is |a) = |5) = |c) = 0. It is a contradiction 
with the resulting maximally entangled state. Hence, 
if ancillas are not allowed, then U cannot create log 2 3 
ebits. The unitary U with m = p = 0 and n = q = 2 
can generate log 2 9 — 16/9 ebits of entanglement starting 
from a product state without ancillas. A corresponding 
choice of such input state is -^(1,1,1)©(5, 5, g, 5), where 

g _ V 3 +V 6 ^ jq — Vs-Ve ^ Numerical evidence suggests 
that this number of log 2 9 — 16/9 ~ 1.392 ebits is optimal 
for this U, even when ancillas are allowed. Of course, if 
TO > 0 in the case above, we can still create the same 
amount of entanglement by letting the input state have 




25 


zero amplitude in the support of the Im- When g or n is 
greater than 2 , up to local permutations there is always 
an s X s cyclic shift submatrix Vn in Vi and a txt cyclic 
shift submatrix V31 in V3, respectively. We ignore the Im 
and other parts of Vi and V3, which means the B-side 
input state has zero amplitude in the support of those 
matrices. Under these conventions, we choose the input 
state to be of the form -^( 1 , 1 , 1 ) 0 {vi,V 2 ), where vi 
and V 2 are vectors of length t and s, respectively, and 
the elements in vi are just two real numbers appearing 
alternately: e, /, e, /,..., and thus the last number in vi 
is e if t is odd, and is / if f is even. Similarly the elements 
in V 2 are just two real numbers appearing alternately: 
g,h^ h,. ■ ■, and thus the last number in V 2 is g if s is 
odd, and is h if s is even. With suitable choices of real 
numbers e, /, g, h, this would give rise to log 2 9 — 16/9 ~ 
1.392 ebits of entanglement in the output state. A class of 
choices of the real 4-tuple (e, /, g, h) for arbitrary t,s > 
2 is given by e - / = g-h= and 

1 ^ 1 1 = 1 ^ 2 ! = When these equations are satisfied, 
the output reduced density operator on the A side would 
be determined, and is the same as that corresponding to 
the optimal output entangled state in the case t = s = 2. 
It is not hard to see that there are two solutions for the 
pair (e, /) and two solutions for the pair (g, h) for the 
equations above, thus there are four solutions (e, /, g, h) 
for these equations, for any t and s. This shows that 
Ke{U) > log 2 9 — 16/9 ~ 1.392 ebits for all U in case 
(I.l) . 

The second case (1.2) is that p > 0 and V 2 ^ Uj- Then 
both of V 2 and V 4 are nonzero. Up to local permutation 
matrices ohHb, we may assume 

U 2 = Is (B[V 21 ,V 22 ], 

U 4 = [Vii,Vi 2 ](Blt (H2) 


with s,t > 0 , where the submatrices V 21 and V42 act on 
the same subspace span{ |s -|- 1 ), • • • ,\p — t)}oi dimension 
p — s — t. The moves in the paragraph including m 
imply that p > s + t. Sol^i and V42 are both nonzero, 
and are in the column vectors of the same position of V 2 
and Vi- Furthermore V 21 and V42 respectively have no 

nonzero diagonal entries of V 2 and 14- So f ^ j |j) ^ 


\j) and f j \j) ^ \j) for all j € [s -b l,p - t]. Note 


that V 21 and V 42 are both of full rank. If 


0 

F21 


U42 

0 


\j) for all j G [s -b l,p — t], then U 21 = 


\j) = 



and V42 


0 

A 


with a permutation matrix X, and thus 


from (IH2D we obtain that V 2 and V 4 are both equal to X 
up to the moves in the paragraph including (IHip . This 
is a contradiction with the assumption at the beginning 
of this paragraph. Hence, we can hnd out some j G 


[s -b l,p - t], such that ^ I/) ^ ) 1^)' 

implies that |j), 14 |j) and 141/) are pairwise orthogonal. 
Let U act on the product state .^(|ai) -b |a 2 ) -b loa))!/), 
where the state joj) satisfies Dj\ak) = i.e., Dj is 

the stabilizer of |aj). So the resulting state .A(|a^)|j) _(- 

102 ) 141 /) + 103 ) 141 /)) is a Schmidt-rank-three maximally 
entangled state, and we have created log 2 3 ebits. 


The third case (1.3) is that p > 0 and 14 = 14. So 
we may assume that 14 does not have nonzero diago¬ 
nal entries, and thus p > 1. Since U has Schmidt rank 
three, n and g are not simultaneously zero. If n = 0 
or g = 0 , by performing the local permutation matrix 
I A 0 {Im+n+q © 14^) On the Ihs of U, we obtain a new 
unitary of the type of case (I.l). Thus we may assume 
n > 0 and g > 0. Since 14 and 14 have no nonzero 
diagonal entries, we have n > 1 and g > 1. Since 
the identity matrix and any permutation matrix are si¬ 
multaneously diagonalizable, (7 is locally equivalent to 
a Schmidt-rank-three diagonal unitary. The unitary U 
with m = 0 and n = g = p = 2 can generate exactly 
log 2 3 ebits of entanglement starting from a product state 
without ancillas. An optimal choice of the input state is 
^(l,l,l) 0 (g,h,g,/i,g,/i), where g = 1 ^, h = 1 ^. 

For generic cases in the case (1.3), we may assume m = 0 
for the same reason as in case (I.l) above, and consider 
n, g,p to be integers not less than two. Up to local per¬ 
mutation unitaries there is a cyclic shift (of length t, s, u 
respectively) in each of the three permutation unitaries 
14 , 14 and 14 , and we let the input state to have nonzero 
amplitude on the support of these operators only and 
let them of the form -^( 1 , 1 , 1 ) 0 (ui,U 2 ,f 3 ), where the 
vi,V 2 ,V 3 are real vectors of length t, s,u, respectively. 
The elements in vi are just two real numbers appear¬ 
ing alternately: e, /, e, /,..., and thus the last number 
in vi is e if t is odd, and is / if 1 is even. Similarly, 
V 2 = (g, /i,g, h,...), and the last number in V 2 is g if s 
is odd, and is h if s is even. And V 3 = (y, z,y, z,. ■ 
and the last number in V 3 is g if u is odd, and is 2 if 
u is even. Then the maximal output entanglement of 
log 2 3 ebits is achievable, by choosing e, f,g,h,y,z G K 

which satisfy that e — f = . ^ , q — h = , ^ , 

■' V 21772 I’ ^ ^/2|47^’ 

y-^ = 74^’ ifoi = = ksi = It 

is not hard to see that there are 2 ^ = 8 real solutions 
(e, /, g, h, y, z) to the equations above, for any t, s, u. And 
since Ke{U) < log 2 r ebits for any U of Schmidt rank r, 
we have that Ke{U) = log 2 3 ebits for all U in case (1.3). 


Case (II). Suppose C/ is a Schmidt-rank-three con¬ 
trolled permutation unitary with four terms, and is con¬ 
trolled from the A side. By following similar arguments 
as in (I) but also noting that the B-side operators in 
all four terms in U are permutation matrices, it can be 
shown that up to local permutation unitaries, the U is of 
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the form 

U= Di® Ib + D2^ {Im® In®Vi) + 

Ds ® {Im 0 ^2 0 Iq) 0 A 0 [Im 0 ^2 0 ), (H3) 

where Dj (j = 1,... ,4) are orthogonal projectors onto 
the computational basis states that add up to /a, while 
Vi and V2 are permutation matrices of size q x q and 
n X n, respectively, and their diagonal elements are all 
zero. And m > 0 is an integer. Again, for the purpose 
of studying the entangling power of U, we may assume 
that all Dj in (IH3I) are one-dimensional projectors. 

When q = n = 2, the entangling power of U is exactly 
log2 3 ebits, and this number is achieved by a product 
input state without ancillas. For example, when m = 0, 
there is an input state of the form i(l, 1,1,1)0(5, h, 5, h) 
which gives the optimal output entanglement, where g = 

and h = are the same numbers as in case 

6 ’ 6 

(I.l). When m > 0, we choose the B-side input state so 
that it has zero amplitude in the support of Im in (IH3I) . 
then we are back to the m = 0 case. For other values of q 
and n, and arbitrary m > 0 (which is treated as m = 0), 
we also have that the entangling power of U is exactly 
log2 3 ebits. A class of the optimal input states is the 
same as those in case (I.l), although it is possible that 
there are other classes of optimal input states as well. 

Case (III). Now the only remaining case is that B is 
of the form of the last case in Lemma [TCT il. An example 
of this case is in (1251) . When no ancillas are allowed, the 
U in (1251) can generate at most 1 ebit, since it is on a 
3x2 dimensional system. When ancillas are allowed, we 
choose the ancillas A' and B' to be of the same size as the 


inpnt systems A and B, respectively, and let the inpnt 
state on the two sides be the maximally entangled states 
Ej=i ln}AA' \kk)BB', respectively, then the 

output state contains exactly log2 3 ebits. For other nni- 
taries U of the type of case (III), up to local permutations 
and a swap of the two systems we may write U as 

U = {Pa®Vb) 

0A [{I A - Pa) 0 Qb + IFa 0 {Ib - Qb)], (H4) 

where Pa and Qb are projectors onto computational ba¬ 
sis states of Ha and Hb, respectively, and Wa is a partial 
permutation matrix which is of full rank in the support of 
I A — Pa, and Vb is a permutation matrix. We choose the 
input state on the A side to be of the form 
where the real coefficients fij take at most three different 
values including zero, and fij = 0 iff {j\WA\j) 0. The 
nonzero values of gj are the same for \j) a in the support 
of Pa- And the same statement holds for the support of 
I A — Pa- And choose the input state on the B side to 
be J2k=i ’^k\kk)BB', where the real coefficients Vk take 
at most three different values including zero, and Bk = 0 
iff {k\VB\k) ^ 0. The nonzero values of Bk are the same 
for \k)B in the support of Qb- And the same statement 
holds for the support oi Ib — Qb- With a suitable choice 
of the fj-j and Vk subject to the constraints above, the 
output entanglement is exactly log2 3 ebits. Therefore, 
the entangling power of U in case (III) is always log2 3 
ebits. 

In summary, we have considered all forms of U, and 
thus the assertion holds. □ 
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